Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Out of memory killer & cgroups
- Date: Thu, 2 Oct 2014 10:23:56 +0000
- From: <andrew.lahiff@xxxxxxxxxx>
- Subject: [HTCondor-users] Out of memory killer & cgroups
Hi,
When cgroups are enabled and the soft memory limit is used, i.e.
BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft
and a job uses so much memory that the system runs out of memory, the OOM killer kills the job:
Oct 2 11:04:15 lcg1077 kernel: Out of memory: Kill process 23856 (condor_exec.exe) score 270 or sacrifice child
Oct 2 11:04:15 lcg1077 kernel: Killed process 23856, UID 99, (condor_exec.exe) total-vm:8433308kB, anon-rss:3961620kB, file-rss:28kB
but it's not at all obvious to the user that this has happened. All that can be seen in the job's ClassAd is:
ExitReason = "died on signal 9 (Killed)"
ExitSignal = 9
A number of tickets seem to suggest that such jobs should be held with a message saying that the job has exceeded it's memory limit (note that in the tests I've done I've had request_memory=1000 with jobs that use much more memory than this).
This is with HTCondor 8.2.2 on an SL6.4 machine with kernel 2.6.32-431.23.3.el6.
Is what I'm seeing the expected behaviour?
Many Thanks,
Andrew.
--
Scanned by iCritical.