Hi, Condor 8.4.11 here. I’m facing jobs which are put to the hold state much too quickly. In the startd logs, I see this kind of thing : 03/06/17 04:28:26 (pid:1158843) Limiting (soft) memory usage to 16777216000 bytes 03/06/17 04:28:26 (pid:1158843) Limiting (hard) memory usage to 9135382528 bytes 03/06/17 04:28:26 (pid:1158843) Unable to commit memory soft limit for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx : 50016 Invalid
argument 03/06/17 04:28:26 (pid:1158843) Limiting memsw usage to 9135386624 bytes 03/06/17 04:28:26 (pid:1158843) Unable to commit memsw limit for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx : 50016 Invalid
argument 03/06/17 04:28:26 (pid:1158843) Unable to commit CPU shares for htcondor/condor_home_condor_slot1_4@xxxxxxxxxxxxxxxxxxxxx: 50016 Invalid argument 03/06/17 05:14:44 (pid:1158843) Hold all jobs 03/06/17 05:14:44 (pid:1158843) Job was held due to OOM event: Job has gone over memory limit of 16000 megabytes. In the condor history, I see : MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 ) LastHoldReasonSubCode = 0 RequestMemory = 16000 LastHoldReasonCode = 34 ResidentSetSize = 3000000 RemoveReason = "Job removed by SYSTEM_PERIODIC_REMOVE due to being in hold state for 6 hours." ResidentSetSize_RAW = 2966680 LastHoldReason = "Error from slot1@xxxxxxxxxxxxxxxxxxxxx: Job has gone over memory limit of 16000 megabytes." Requirements = ( ( RequestCpus == 8 || RequestCpus == 1 ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk
>= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer ) What I’m failing to understand is the following :
-
3 000 000 kb of RSS is lower than 16 000 MB of memory, so why was the job set to hold ? -
Why is the job assigned a memory soft quota that’s higher than the hard quota ? In that case, only the hard quota can be used… ? -
What’s the meaning of the “invalid argument” cgroup errors ? I presume I’ll have to dig into this to fix the real issue behind these errors… I have checked I have the kernel tunning options on. One more question : I understood from the various condor docs and wikis that cgroups should allow for swap being allocated when the jobs exceed the memory usage, but I don’t see any swap being allocated so I’m wondering
how this behavior can be controlled, if only it can ? any hints ? Frederic |