Hi, So, I deployed cgroups for htcondor…. And I’m having issues. First : looks like I have to restart condor, not just condor_reconfig it. (right ?) Without a service restart, I get logs about ‘cgroups not initialized’. But problem is that even after restarting condor, the ProcsLog shows these errors : 03/18/16 20:59:53 : Unable to read cgroup htcondor/condor_home_condor_slot1_2@wn296 cpuacct stats (ProcFamily 3278233): Cgroup invalid operation. 03/18/16 20:59:53 : Internal cgroup error when retrieving CPU statistics: Cgroup invalid operation 03/18/16 20:59:53 : Unable to read cgroup htcondor/condor_home_condor_slot1_2@wn296 memory stats (ProcFamily 3278233): 50016 No such file or directory. I am deploying cgroups so that memory accounting (RSS) doesn’t double-tripple count stuff, which in the end is causing the jobs to be killed after they are reported as consuming more memory than requested. I’m wondering how to fix this ? To answer coming questions : -
Yes, cgconfig is running, and cgroups are mounted (but not visible with the mount command – sl6x/sl6.7 here) : [root@wn296 htcondor]# service cgconfig status Running [root@wn296 htcondor]# grep cgroup /proc/mounts cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0 cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0 cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0 cgroup /cgroup/devices cgroup rw,relatime,devices 0 0 cgroup /cgroup/memory cgroup rw,relatime,memory 0 0 cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0 cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0 cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0 -
Yes, I see htcondor subdirectories, and I even see PIDs in the subdirectories tasks files : [root@wn296 htcondor]# wc -l /cgroup/memory/htcondor/condor_home_condor_slot1_*/tasks|sed -r -e 's%(wn...).*/%\1/%' 0 /cgroup/memory/htcondor/condor_home_condor_slot1_10@wn296/tasks 0 /cgroup/memory/htcondor/condor_home_condor_slot1_11@wn296/tasks 6 /cgroup/memory/htcondor/condor_home_condor_slot1_1@wn296/tasks 22 /cgroup/memory/htcondor/condor_home_condor_slot1_2@wn296/tasks 22 /cgroup/memory/htcondor/condor_home_condor_slot1_3@wn296/tasks 22 /cgroup/memory/htcondor/condor_home_condor_slot1_4@wn296/tasks 0 /cgroup/memory/htcondor/condor_home_condor_slot1_5@wn296/tasks 0 /cgroup/memory/htcondor/condor_home_condor_slot1_6@wn296/tasks 0 /cgroup/memory/htcondor/condor_home_condor_slot1_7@wn296/tasks 0 /cgroup/memory/htcondor/condor_home_condor_slot1_8@wn296/tasks 0 /cgroup/memory/htcondor/condor_home_condor_slot1_9@wn296/tasks 72 total -
Cgroups config : [root@wn296 htcondor]# cat /etc/cgconfig.{conf,d/htcondor.conf} # This file is being maintained by Puppet. # DO NOT EDIT mount { cpu = /cgroup/cpu; cpuset = /cgroup/cpuset; cpuacct = /cgroup/cpuacct; devices = /cgroup/devices; memory = /cgroup/memory; freezer = /cgroup/freezer; net_cls = /cgroup/net_cls; blkio = /cgroup/blkio; } group htcondor { cpu {} cpuacct {} memory {} freezer {} blkio {} } -
And condor version : 8.4.3-2 The problem I have is that I have jobs submitted (not by me) with Memory requirements and that these jobs are still killed because of the RSS approximative accounting without cgroups – and for now, the killing is still
going on :’( Any ideas about what’s wrong, and better, how to fix ? I admin I’m very new to cgroups, so it might be that I’m mistaken somewhere… Thanks Frederic |