[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] cgroups and OOM killers



Hello HTCondor experts,

We're seeing some interesting behaviour with user jobs on our local HTCondor cluster, running version 9.8.
Basically, if a job in the cgroup manages to go sufficiently over memory 
so that the container cannot allocate accountable memory that is needed 
for basic functioning of the system as a whole (e.g. to hold its 
cmdline), then the container has impact on the whole system and will 
bring it down. This is a worse condition than condor not being able to 
fully get the status/failure reason for any single specific container. 
And since oom_kill_disable is set to 1, the kernel will now not 
intervene and hence the entire system grinds to a halt. It is preferable 
to loose state for a single job, have the kernel do its thing, and have 
the system survive. Now, the only workaround is to run for i in 
/sys/fs/cgroup/memory/htcondor/condor*/memory.oom_control ; do echo 0 > 
$i ; done in a loop to ensure the sysadmin-intended settings are applied 
to the condor-managed cgroups.
Is there a configurable setting for oom_kill_disable 0? Shouldn't this 
be an option or was there another reason for the oom_kill_disable being 
set to 1?
Thanks,

Mary