On 4/30/20 4:27 AM, jean-michel.barbet@xxxxxxxxxxxxxxxxx wrote:
> On 4/30/20 6:19 AM, tpdownes@xxxxxxxxx wrote:
>> I do think your problem is as simple as Thomas' question: figuring
>> out why oom_control is set to disabled. These cgroup settings are
>> inherited hierarchically so it could be the htcondor group itself or
>> a cgroup above it. It could even be set system-wide.
HTCondor intentionally sets oom_kill_disable because the starter really
needs to know if the job was OOM killed, and treat the job differently
than if it just got a normal signal 9. We think it is very unfortunate
that the OOM killer kills with the usual signal 9, and not a custom
signal just for OOM -- we wouldn't need to do this if the OOM signal was
its own value. The starter also installs a handler to get notified when
the kernel oom-kills a process in the job. This lets the starter clean
up the job, and put the job on hold with an appropriate message if it
gets OOM killed. If we didn't do this, the an OOM killed job would be
killed with signal 9, and probably leave the queue, as from condor's
perspective, it has exitted of its own accord.
-greg
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/