[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor cgroup hard setting related queries




On 3/2/21 3:52 AM, ervikrant06@xxxxxxxxx wrote:



## Test with condor 8.8.5 (Stable release) and CentOS Linux release 7.9.2009

Test 1 & Test 2: In both cases below message reported in slot log file but job was infinitely in running state until manual action taken.Â

Spurious OOM event, usage is 2, slot size is 5399 megabytes, ignoring OOM (read 8 bytes)


Hi Vikram:

I believe there were some bugs in cgroup OOM handling in older condor versions. Can you try with the setting

IGNORE_LEAF_OOM = false?

-greg



Questions:

- If we really need to use SYSTEM_PERIODIC_HOLD with cgroup hard setting then what would be the right _expression_ for partitionable slots?
- Why is the centos7 node job either marked as completed or held despite breaching the mem limit like it did in rhel6 setup?
- Why in some cases results are completely empty for hold reason codes and in some they are returned successfully from the executor node?
- Is't okay to use WANT_HOLD and SYSTEM_PERIODIC_HOLD both together? We are currently using WANT_HOLD to hold jobs if they are running more than stipulated time.

[1] https://www-auth.cs.wisc.edu/lists/htcondor-users/2021-February/msg00123.shtml
[2] https://www-auth.cs.wisc.edu/lists/htcondor-users/2019-August/msg00064.shtml


Thanks & Regards,
Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/