[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] cgroup error


I am using SL6 (2.6.32-504.8.1.el6.x86_64) and HTC 8.3.7 Jul 23 2015 BuildID: 331383

I enabled cgroups as described in the memory and send 'stress' jobs using 10 gb memory while announcing 1gb of memory usage via submit file. 

The result is somehow not as I would expect: 

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                         
20378 chbeyer   20   0  9.8g 1.0g  144 D  2.0  6.4   0:01.03 stress                                                                          
20400 chbeyer   20   0  9.8g 1.0g  144 D  2.0  6.4   0:00.77 stress                                                                          
20381 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.4   0:00.97 stress                                                                          
20384 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.97 stress                                                                          
20386 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.4   0:01.03 stress                                                                          
20388 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.86 stress                                                                          
20392 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.80 stress                                                                          
20398 chbeyer   20   0  9.8g 1.0g  144 D  1.3  6.3   0:00.71 stress           

The procd log file shows some errors: 

08/25/15 15:40:26 : PROC_FAMILY_GET_USAGE
08/25/15 15:40:26 : gathering usage data for family with root pid 20361
08/25/15 15:40:26 : Unable to read cgroup htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxx cpuacct stats (ProcFamily 20373): Cgroup invalid operation.
08/25/15 15:40:26 : Internal cgroup error when retrieving CPU statistics: Cgroup invalid operation
08/25/15 15:40:26 : Unable to read cgroup htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxx memory stats (ProcFamily 20373): 50016 No such file or directory.
08/25/15 15:40:26 : PROC_FAMILY_GET_USAGE
08/25/15 15:40:26 : gathering usage data for family with root pid 20360
08/25/15 15:40:26 : Unable to read cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxx cpuacct stats (ProcFamily 20372): Cgroup invalid operation.
[ snip ]
08/25/15 13:46:41 : Setting cgroup to htcondor/condor_var_lib_condor_execute_slot1_3@xxxxxxxxxxxxxxxx for ProcFamily 15896.
08/25/15 13:46:41 : Warning - cgroup controller cpuacct not mounted (but not required).
08/25/15 13:46:41 : Warning - cgroup controller memory not mounted (but not required).
08/25/15 13:46:41 : Warning - cgroup controller freezer not mounted (but not required).
08/25/15 13:46:41 : Warning - cgroup controller blkio not mounted (but not required).
08/25/15 13:46:41 : Warning - cgroup controller cpu not mounted (but not required).
08/25/15 13:46:41 : Cannot attach pid 15896 to cgroup htcondor/condor_var_lib_condor_execute_slot1_3@xxxxxxxxxxxxxxxx for ProcFamily 15896: 50014 Cgroup not initialized

I thought the jobs that by far exceed the memory limit would be killed and go on hold but that seems only to happen from time to time (?) 

best regards

/*   Christoph Beyer     |   Office: Building 2b / 23     *\
 *   DESY                |    Phone: 040-8998-2317        *
 *   - IT -              |      Fax: 040-8994-2317        *
\*   22603 Hamburg       |     http://www.desy.de         */