Hi all, there are ongoing nodes appearing where jobs are not placed into their own slice but are just below the main condor cgroup. E.g., one node has currently 20 jobs running while has only six dedicated sub slices in the CPU cgroup below the main condor slice [1]. I.e., it looks like that jobs get started without a cgroup getting created Even more wired, there are slices, that do not have any PIDs assigned - e.g., on this node slot1_16 has got a sub-slice created [2], it does not look like that any starter/exec got actually started into the created sub-group (but ended up again under the parent condor group). Side note: on some of these nodes we have (ro) bind-mounted /sys/fs/cgroup into a Singularity container. However, that should(?!) not affect any Condor process running outside this container below the root namespace (I assume...) - at least the bind-mount is not appearing in the root namespace [3] Cheers, Thomas [1] > ls -1 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/ | grep slot | wc -l 6 > wc -l /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_*/tasks 0 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_14@xxxxxxxxxxxxxxxxx/tasks 0 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_16@xxxxxxxxxxxxxxxxx/tasks 53 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_19@xxxxxxxxxxxxxxxxx/tasks 18 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxx/tasks 0 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_7@xxxxxxxxxxxxxxxxx/tasks 18 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_9@xxxxxxxxxxxxxxxxx/tasks 89 total > ps axf | grep starter | grep grid-arcce | wc -l 20 > wc -l /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks 1078 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks [2] > cat /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_16@xxxxxxxxxxxxxxxxx/tasks > echo $? 0 [2.b] > ps axf | grep -A 5 slot1_16 ... 48029 ? Ss 0:00 \_ condor_starter -f -a slot1_16 grid-arcce0.desy.de 48034 ? Ss 0:00 | \_ /bin/bash -l /var/lib/condor/execute/dir_48029/condor_exec.exe 48089 ? S 0:00 | \_ /usr/bin/time -o /var/lib/condor/execute/dir_48029/5d0KDmvIspsnntDnJpfbFDFoABFKDmABFKDmjFbbDmABFKDmVI3j3m.diag [2.c] > grep 48034 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks 48034 > grep 48034 /sys/fs/cgroup/cpu\,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_16@xxxxxxxxxxxxxxxxx/tasks > echo $? 0 [3] > findmnt | grep "\[" ââ/home /dev/sda6[/home] ext4 rw,relatime,data=ordered ââ/tmp /dev/sda6[/tmp] ext4 rw,relatime,data=ordered On 2018-06-07 20:41, Todd Tannenbaum wrote: > On 6/7/2018 10:44 AM, Thomas Hartmann wrote: >> Hi all, >> >> I just noticed, that a few of our nodes have their jobs not confined in >> cgroups - i.e., no condor slice at all [1]. These nodes are setup the >> same and on the same release [2] as the majority of the nodes where the >> jobs are properly cgrouped. >> We are going to drain and reboot these nodes, but maybe somebody has an >> idea, what might have gone wrong here? >> >> Cheers, >> Thomas >> > > Hi Thomas, > > Unlike some others on this list, I am not a cgroup expert, but what does "condor_config_val BASE_CGROUP" have to say on these two machines? The default value is "htcondor", so to poke around in /sys/fs/cgroup, I would not be going into system.slice subdirectory (systemd settings), but would do something like: > > # ls /sys/fs/cgroup/cpu,cpuacct/htcondor/condor_var_lib_condor_execute_slot1_slot1_* > > Hope the above helps > Todd > > >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature