Hi all, is there a way in Condor to tune the memory limits for the jobs' cgroups more fine grained? Thing is, that we just had a few nodes, which a user managed to swap to death (e.g., see the attached stats). As for the memory handling we are running so far with soft limits, i.e., CGROUP_MEMORY_LIMIT_POLICY = soft which is AFAIS reflected in the job slices' memory.max_usage_in_bytes as well as in the startd log [1]. Since the hard and the mem+swap limits are pretty generous, they will never take effect, I suppose. Also, if I understand memory.oom_control correctly, the out-of-memory control is actually not handled by the kernel but by Condor [2], or? I guess, it is for Condor to clean up a job, or? Since on the affected nodes the OOM situation became serious pretty rapidly, I wonder if we can make the memory control more strict but still allow for a soft over-allocation? E.g., per job hard limits for mem / memsw in multiples of soft_limit_in_bytes but still below the total mem / total mem + X. For the moment I am trying to limit the condor unit's slice overall memory as safeguard to keep the node responsive - obviously for the price that all jobs/slices below will get indiscriminately affected when another job sends the whole Condor slice into the limit :( Cheers, Thomas [1] 07/30/18 07:15:55 (pid:41725) Running job as user cmsplt036 07/30/18 07:15:55 (pid:41725) Create_Process succeeded, pid=41745 07/30/18 07:15:55 (pid:41725) Limiting (soft) memory usage to 0 bytes 07/30/18 07:15:55 (pid:41725) Limiting memsw usage to 9223372036854775807 bytes 07/30/18 07:15:55 (pid:41725) Limiting (soft) memory usage to 21072183296 bytes 07/30/18 07:15:55 (pid:41725) Limiting (hard) memory usage to 278668124160 bytes 07/30/18 07:15:55 (pid:41725) Limiting memsw usage to 278668128256 bytes where MemTotal: 263944848 kB i.e., the hard mem and memsw limits are both set ~15GB larger than the total mem [2] > cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxx/memory.oom_control oom_kill_disable 1 under_oom 0
Attachment:
batch0946_load.png
Description: PNG image
Attachment:
batch0946_mem.png
Description: PNG image
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature