Hi all, unfortunately, I have to bring up again my cgroup thing :-[ I can reproduce by starting a systemd unit to empty all condor job sub-slices by from their tasks, i.e., deleting the cgroup, and thus, get the processes moved up to the main condor parent cgroup. * initial situation are a few processes in the main condor cgroup [1.a] as the condor_starters [1.b] ** in the vFS sub-cgroups are appearing with tasks containing the job PIDs [1.b] - i.e., > /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_*@batch0311.desy.de/tasks * in systemd-cgls only the condor.lice group is listed with all child processes directly beneath [2] * adding a stripped down unit file to /etc/systemd/system/my-htcondor-test-unit.service [2] * reloading the units with > systemctl daemon-reload > systemctl status my-htcondor-test-unit.service ** cgroup tasks stay unaffected * starting the unit with > systemctl start my-htcondor-test-unit.service ** the job-cgroups get deleted and the processes get attached to the parent group [3] * I can reproduce this behaviour only for processes started by condor into their cgroups. ** when creating manually a sub-cgroup in the condor slice and attaching a processes to it [4] or creating a separated cgroup tree, the processes attached to both these groups are not affected and stay attached to their cgroups So far, I have not found a way how to debug/understand, what systemd or condor are doing in detail during the service start - especially since I do not see, how the condor unit and its cgroups are related to another service unit. What makes me wonder is, that (before starting the service etc., i.e., all still good) systemd-cgls lists all job processes etc. just below the condor.slice cgroup [5]. Is systemd maybe becoming aware of some "divergence", i.e., no sub-groups below condor.slice known to systemd vs. sub-cgroups created by condor in parallel(??) to systemd - and systemd tries to 'fix' it?? Cheers and thanks for any ideas, Thomas [1.a] > wc -l /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks 31 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks [1.b] > for X in $(ls -1 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/*/tasks); do echo $X; cat $X; done /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_11@xxxxxxxxxxxxxxxxx/tasks 23575 23634 ... /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/condor_var_lib_condor_execute_slot1_12@xxxxxxxxxxxxxxxxx/tasks 10887 10947 ... .. [2] ... ââcondor.service â ââ 693 /bin/sh -c cd /var/lib/condor/execute/dir_13310/4hiKDmAgQrsnntDnJpfbFDFoABFKDmABFKDml1GaDmABFKDmVunOMm/Panda_Pilot_33037_1529872420/PandaJob;export ATLAS_LOCAL_ROOT_BASE=/cvmfs â ââ 1454 /bin/sh -c cd /var/lib/condor/execute/dir_13297/5mGKDm6fQrsnntDnJpfbFDFoABFKDmABFKDmKVGaDmABFKDmkyuWmn/Panda_Pilot_32654_1529872419/PandaJob;export ATLAS_LOCAL_ROOT_BASE=/cvmfs â ââ 2255 /bin/bash /var/lib/condor/execute/dir_58999/glide_F46vYJ/main/condor_startup.sh glidein_config â ââ 2284 python pilot3/pilot.py -h DESY-HH_UCORE -s DESY-HH_UCORE -f false -p 25443 -d {HOME} -w https://pandaserver.cern.ch â ââ 2506 /bin/sh -c cd /var/lib/condor/execute/dir_13293/AG1MDm7fQrsnntDnJpfbFDFoABFKDmABFKDmgiGaDmABFKDmDnAsxn/Panda_Pilot_32209_1529872419/PandaJob;export ATLAS_LOCAL_ROOT_BASE=/cvmfs â ââ 2673 condor_starter -f -a slot1_2 cmsgwms-submit5.fnal.gov â ââ 2719 /usr/libexec/singularity/bin/action-suid /srv/.osgvo-user-job-wrapper.sh /srv/condor_exec.exe prozober_task_SMP-RunIISummer15wmLHEGS-00226__v1_T_180611_192517_1020-Sandbox.tar. â ââ 2821 /bin/bash /srv/condor_exec.exe prozober_task_SMP-RunIISummer15wmLHEGS-00226__v1_T_180611_192517_1020-Sandbox.tar.bz2 3052266 â ââ 2931 python2 Startup.py â ââ 3063 /bin/bash /srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh slc6_amd64_gcc481 scramv1 CMSSW CMSSW_7_1_31 FrameworkJobReport.xml cmsRun PSet.py â ââ 3105 cmsRun -j FrameworkJobReport.xml PSet.py â ââ 3151 /bin/bash /var/lib/condor/execute/dir_55491/glide_WHVFkw/main/condor_startup.sh glidein_config â ââ 3197 /var/lib/condor/execute/dir_58999/glide_F46vYJ/main/condor/sbin/condor_master -f -pidfile /var/lib/condor/execute/dir_58999/glide_F46vYJ/condor_master2.pid â ââ 3200 condor_procd -A /var/lib/condor/execute/dir_58999/glide_F46vYJ/log/procd_address -L /var/lib/condor/execute/dir_58999/glide_F46vYJ/log/ProcLog -R 1000000 -S 60 -C 40936 â ââ 3201 condor_startd -f ... [3] > cat /etc/systemd/system/my-htcondor-test-unit.service >>> [Unit] Description=Stripped down systemd unit file [Service] ExecStart=/bin/ls Environment="PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin" [Install] WantedBy=multi-user.target >>> [4] > wc -l /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks 537 /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/tasks [5] mkdir /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/th.test/ nohup sleep 800 # $$ --> 57104 echo 57104 > /sys/fs/cgroup/cpu,cpuacct/system.slice/condor.service/th.test/tasks [6] * cgroup related condor settings are: BASE_CGROUP = /system.slice/condor.service CGROUP_MEMORY_LIMIT_POLICY = soft * system is CentOS 7 on 3.10.0-693.21.1.el7.x86_64 * with installed packages condor-classads-8.6.11-1.el7.x86_64 condor-external-libs-8.6.11-1.el7.x86_64 condor-python-8.6.11-1.el7.x86_64 condor-8.6.11-1.el7.x86_64 condor-procd-8.6.11-1.el7.x86_64 libcgroup-0.41-15.el7.x86_64
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature