Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] systemd interfering with Condor job cgroups
- Date: Sat, 30 Jun 2018 10:34:39 -0700
- From: Stuart Anderson <anderson@xxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] systemd interfering with Condor job cgroups
> On Jun 30, 2018, at 9:17 AM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
>
> Hi Greg,
>
> in the end without the delegation, my steps to shred the sub-slices were
> reduced to
> * adding a new basic unit to /etc/systemd/system (userland units and
> slices behaved differently)
> * reload the units aka 'systemd daemon-reload'
> * and then start the fresh unit
>
> worked for me at least on CentOS 7.5.1804 and systemd-219-57
>
> I have no idea, why a unit restart does not reliably trigger the
> behaviour - but only a new unit. It should be somewhat the same for
> systemd?? Probably only Poettering knows... ;)
>
> Interestingly, systemd-cgls showed systemd's cgoup view during all
> steps, i.e., ignoring(??) the kernel's cgroup hierarchy.
> Anyway, with delegation on, the sub-slices are still not known according
> to systemd-cgls and all PIDs hang below the condor slice -- but at least
> systemd does not care anymore for them
Thomas, Greg,
With Delegation=yes on an SL7.5 system I also see systemd-cgls apparently showing all of the condor dynamic slot processes as if they where running directly in the condor.service group. However, systemd-cgtop shows the expecgted hierarchy.
[root@node775 ~]# systemd-cgls
ââ1 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
ââuser.slice
â ââuser-0.slice
â ââsession-190.scope
â ââ57619 sshd: root@pts/0
â ââ57621 -bash
â ââ58402 systemd-cgls
â ââ58403 less
ââsystem.slice
ââsm-client.service
â ââ18092 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueu
ââsendmail.service
â ââ18072 sendmail: accepting connection
ââgmond.service
â ââ7802 /usr/sbin/gmond
ââcrond.service
â ââ3467 /usr/sbin/crond -n
ââatd.service
â ââ3462 /usr/sbin/atd -f
ââautofs.service
â ââ3453 /usr/sbin/automount --pid-file /run/autofs.pid
âârpcbind.service
â ââ3310 /sbin/rpcbind -w
âârpc-statd.service
â ââ3306 /usr/sbin/rpc.statd
ââcondor.service
â ââ 3001 /usr/sbin/condor_master -f
â ââ 8415 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 985
â ââ 8450 condor_shared_port -f
â ââ 9011 condor_startd -f
â ââ 9032 condor_ckpt_server
â ââ24234 condor_starter -f -a slot1_11 pcdev2.ldas.cit
â ââ24238 bash /home/cbc/pe/local/bin/lalinference_mpi_wrapper --a_spin2-max 0.89 --L1-flow 20 --approx IMRPhenomPv2pseudoFourPN --psdlength 128.0 --mpirun /ldcg/intel/2018u1/compilers_and_l
â ââ24239 /bin/sh /ldcg/intel/2018u1/compilers_and_libraries_2018.1.163/linux/mpi/intel64/bin/mpirun -np 8 /home/cbc/pe/local/bin/lalinference_mcmc --a_spin2-max 0.89 --L1-flow 20 --approx I
â ââ24244 mpiexec.hydra -np 8 /home/cbc/pe/local/bin/lalinference_mcmc --a_spin2-max 0.89 --L1-flow 20 --approx IMRPhenomPv2pseudoFourPN --psdlength 128.0 --nlive 2048 --adapt-temps --srate
â ââ24245 /ldcg/intel/2018u1/compilers_and_libraries_2018.1.163/linux/mpi/intel64/bin/pmi_proxy --control-port node775.cluster.ldas.cit:42940 --pmi-connect alltoall --pmi-aggregate -s 0 --rm
â ââ24249 /home/cbc/pe/local/bin/lalinference_mcmc --a_spin2-max 0.89 --L1-flow 20 --approx IMRPhenomPv2pseudoFourPN --psdlength 128.0 --nlive 2048 --adapt-temps --srate 2048.0 --H1-cache /h
â ââ24250 /home/cbc/pe/local/bin/lalinference_mcmc --a_spin2-max 0.89 --L1-flow 20 --approx IMRPhenomPv2pseudoFourPN --psdlength 128.0 --nlive 2048 --adapt-temps --srate 2048.0 --H1-cache /h
â ââ24251 /home/cbc/pe/local/bin/lalinference_mcmc --a_spin2-max 0.89 --L1-flow 20 --approx IMRPhenomPv2pseudoFourPN --psdlength 128.0 --nlive 2048 --adapt-temps --srate 2048.0 --H1-cache /h
â ââ24252 /home/cbc/pe/local/bin/lalinference_mcmc --a_spin2-max 0.89 --L1-flow 20 --approx IMRPhenomPv2pseudoFourPN --psdlength 128.0 --nlive 2048 --adapt-temps --srate 2048.0 --H1-cache /h
â ââ24253 /home/cbc/pe/local/bin/lalinference_mcmc --a_spin2-max 0.89 --L1-flow 20 --approx IMRPhenomPv2pseudoFourPN --psdlength 128.0 --nlive 2048 --adapt-temps --srate 2048.0 --H1-cache /h
â ââ24254 /home/cbc/pe/local/bin/lalinference_mcmc --a_spin2-max 0.89 --L1-flow 20 --approx IMRPhenomPv2pseudoFourPN --psdlength 128.0 --nlive 2048 --adapt-temps --srate 2048.0 --H1-cache /h
â ââ24255 /home/cbc/pe/local/bin/lalinference_mcmc --a_spin2-max 0.89 --L1-flow 20 --approx IMRPhenomPv2pseudoFourPN --psdlength 128.0 --nlive 2048 --adapt-temps --srate 2048.0 --H1-cache /h
â ââ24256 /home/cbc/pe/local/bin/lalinference_mcmc --a_spin2-max 0.89 --L1-flow 20 --approx IMRPhenomPv2pseudoFourPN --psdlength 128.0 --nlive 2048 --adapt-temps --srate 2048.0 --H1-cache /h
â ââ26278 condor_starter -f -a slot1_17 grid.ldas.cit
â ââ26282 python /home/gstlalcbc/modules/post_O2/O2_C02_Virgo_deglitch_Haswell_180615/opt/bin/gstlal_inspiral_calc_likelihood --likelihood-url gstlal_inspiral_marginalize_likelihood/H1L1V1-0
â ââ33797 condor_starter -f -a slot1_12 pcdev6.ldas.cit
â ââ33801 condor_exec.exe --a_spin2-max 0.05 --approx IMRPhenomPv2pseudoFourPN --H1-timeslide 0 --psdlength 128.0 --chirpmass-max 2.16847416667 --L1-spcal-phase-uncertainty 5 --distance-prio
â ââ35856 condor_starter -f -a slot1_1 pcdev13.ldas.cit
â ââ35857 /home/vincent.roma/src/lalsuite/lalapps/src/inspiral/posterior/.libs/lt-lalinference_nest --psdlength 12 --nlive 100 --nmcmc 100 --srate 4096 --seglen 3.0 --trigtime 1128230297 --a
â ââ44836 condor_starter -f -a slot1_3 pcdev5.ldas.cit
â ââ44837 /home/vincent.roma/src/lalsuite/lalapps/src/inspiral/posterior/.libs/lt-lalinference_nest --psdlength 12 --nlive 100 --nmcmc 100 --srate 4096 --seglen 3.0 --trigtime 1128282137 --a
â ââ46840 condor_starter -f -a slot1_8 pcdev1.ldas.cit
â ââ46841 /home/vincent.roma/src/lalsuite/lalapps/src/inspiral/posterior/.libs/lt-lalinference_nest --psdlength 12 --nlive 100 --nmcmc 100 --srate 4096 --seglen 3.0 --trigtime 1128230297 --a
â ââ48102 condor_starter -f -a slot1_10 pcdev5.ldas.cit
â ââ48103 /home/vincent.roma/src/lalsuite/lalapps/src/inspiral/posterior/.libs/lt-lalinference_nest --psdlength 12 --nlive 100 --nmcmc 100 --srate 4096 --seglen 3.0 --trigtime 1128247577 --a
â ââ48222 condor_starter -f -a slot1_5 pcdev3.ldas.cit
â ââ48239 /home/vincent.roma/src/lalsuite/lalapps/src/inspiral/posterior/.libs/lt-lalinference_nest --psdlength 12 --nlive 100 --nmcmc 100 --srate 4096 --seglen 3.0 --trigtime 1128213017 --a
â ââ51235 condor_starter -f -a slot1_14 pcdev5.ldas.cit
â ââ51236 condor_starter -f -a slot1_15 pcdev5.ldas.cit
â ââ51237 /home/vincent.roma/src/lalsuite/lalapps/src/inspiral/posterior/.libs/lt-lalinference_nest --psdlength 12 --nlive 100 --nmcmc 100 --srate 4096 --seglen 3.0 --trigtime 1128213017 --a
â ââ51238 /home/vincent.roma/src/lalsuite/lalapps/src/inspiral/posterior/.libs/lt-lalinference_nest --psdlength 12 --nlive 100 --nmcmc 100 --srate 4096 --seglen 3.0 --trigtime 1128213017 --a
â ââ57448 condor_starter -f -a slot1_16 pcdev5.ldas.cit
â ââ57449 condor_exec.exe --amp-order 0 --n-eff 100 --time-marginalization --cache-file ../local.cache --n-max 2000000 --save-P 0.1 --n-chunk 4000 --fmax 2047.0 --output-file CME-SEOBNRv4T-E
â ââ57450 condor_exec.exe --amp-order 0 --n-eff 100 --time-marginalization --cache-file ../local.cache --n-max 2000000 --save-P 0.1 --n-chunk 4000 --fmax 2047.0 --output-file CME-SEOBNRv4T-E
â ââ57536 condor_starter -f -a slot1_2 pcdev5.ldas.cit
â ââ57537 condor_exec.exe --amp-order 0 --n-eff 100 --time-marginalization --cache-file ../local.cache --n-max 2000000 --save-P 0.1 --n-chunk 4000 --fmax 2047.0 --output-file CME-SEOBNRv4T-E
â ââ57538 condor_exec.exe --amp-order 0 --n-eff 100 --time-marginalization --cache-file ../local.cache --n-max 2000000 --save-P 0.1 --n-chunk 4000 --fmax 2047.0 --output-file CME-SEOBNRv4T-E
â ââ57539 condor_starter -f -a slot1_4 pcdev5.ldas.cit
â ââ57540 condor_exec.exe --amp-order 0 --n-eff 100 --time-marginalization --cache-file ../local.cache --n-max 2000000 --save-P 0.1 --n-chunk 4000 --fmax 2047.0 --output-file CME-SEOBNRv4T-E
â ââ57543 condor_starter -f -a slot1_9 pcdev5.ldas.cit
â ââ57544 condor_exec.exe --amp-order 0 --n-eff 100 --time-marginalization --cache-file ../local.cache --n-max 2000000 --save-P 0.1 --n-chunk 4000 --fmax 2047.0 --output-file CME-SEOBNRv4T-E
â ââ57545 condor_starter -f -a slot1_13 pcdev5.ldas.cit
â ââ57546 condor_exec.exe --amp-order 0 --n-eff 100 --time-marginalization --cache-file ../local.cache --n-max 2000000 --save-P 0.1 --n-chunk 4000 --fmax 2047.0 --output-file CME-SEOBNRv4T-E
â ââ57549 condor_exec.exe --amp-order 0 --n-eff 100 --time-marginalization --cache-file ../local.cache --n-max 2000000 --save-P 0.1 --n-chunk 4000 --fmax 2047.0 --output-file CME-SEOBNRv4T-E
â ââ57552 condor_exec.exe --amp-order 0 --n-eff 100 --time-marginalization --cache-file ../local.cache --n-max 2000000 --save-P 0.1 --n-chunk 4000 --fmax 2047.0 --output-file CME-SEOBNRv4T-E
â ââ57553 condor_exec.exe --amp-order 0 --n-eff 100 --time-marginalization --cache-file ../local.cache --n-max 2000000 --save-P 0.1 --n-chunk 4000 --fmax 2047.0 --output-file CME-SEOBNRv4T-E
ââboinc-client.service
â ââ 7794 /usr/bin/boinc_client --daemon --start_delay 1
...
[root@node775 ~]# systemd-cgtop
Path Tasks %CPU Memory Input/s Output/s
/ 277 2303.3 119.9G - -
/system.slice - 2300.9 119.9G - -
/system.slice/condor.service 54 2297.3 114.6G - -
/system.slice/condor.service/condor_local_condor_execute_slot1_11@xxxxxxxxxxxxxxxxxxxxxxxx 12 799.7 919.5M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_13@xxxxxxxxxxxxxxxxxxxxxxxx 2 100.1 821.9M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_9@xxxxxxxxxxxxxxxxxxxxxxxx 2 100.1 822.0M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_4@xxxxxxxxxxxxxxxxxxxxxxxx 2 100.1 840.1M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_14@xxxxxxxxxxxxxxxxxxxxxxxx 1 100.0 125.5M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_16@xxxxxxxxxxxxxxxxxxxxxxxx 2 100.0 1.0G - -
/system.slice/condor.service/condor_local_condor_execute_slot1_3@xxxxxxxxxxxxxxxxxxxxxxxx 1 100.0 349.9M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_5@xxxxxxxxxxxxxxxxxxxxxxxx 1 100.0 111.1M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_17@xxxxxxxxxxxxxxxxxxxxxxxx 1 100.0 7.3G - -
/system.slice/condor.service/condor_local_condor_execute_slot1_10@xxxxxxxxxxxxxxxxxxxxxxxx 1 100.0 116.0M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_15@xxxxxxxxxxxxxxxxxxxxxxxx 1 99.9 111.1M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_12@xxxxxxxxxxxxxxxxxxxxxxxx 1 99.9 814.5M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_6@xxxxxxxxxxxxxxxxxxxxxxxx 2 99.7 463.2M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_8@xxxxxxxxxxxxxxxxxxxxxxxx 1 99.6 153.5M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_2@xxxxxxxxxxxxxxxxxxxxxxxx 2 99.6 828.3M - -
/system.slice/condor.service/condor_local_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxx 1 99.2 158.8M - -
/system.slice/system-llldd\x2dDpushM.slice 12 2.3 370.6M - -
/user.slice 3 2.0 15.7M - -
/system.slice/boinc-client.service 13 1.1 3.4G - -
Note, the cgtop view of condor_local_condor_execute_slot1_{1..17} matches condor_status,
[root@ldas-grid ~]# condor_status node775.cluster.ldas.cit
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 1.000 179463 0+16:24:41
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.250 4096 0+10:21:37
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.250 2048 0+00:04:54
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.260 4096 0+06:18:20
slot1_4@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.270 2048 0+00:04:52
slot1_5@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.250 4096 0+04:36:19
slot1_6@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.270 4096 0+06:06:40
slot1_7@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Idle 0.200 2048 0+00:00:07
slot1_8@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.260 4096 0+05:21:26
slot1_9@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.250 2048 0+00:04:52
slot1_10@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.260 4096 0+04:44:07
slot1_11@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 10.310 16000 0+14:51:16
slot1_12@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.260 4096 0+11:09:25
slot1_13@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.250 2048 0+00:04:52
slot1_14@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.250 4096 0+03:09:08
slot1_15@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.250 4096 0+03:09:08
slot1_16@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.260 2048 0+00:11:06
slot1_17@xxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy 1.260 11008 0+13:57:47
Machines Owner Claimed Unclaimed Matched Preempting Drain
X86_64/LINUX 18 0 17 1 0 0 0
Total 18 0 17 1 0 0 0
Greg, if you agree please see if you can find the right place to file an RFE ticket for systemd-cgls to indent sub-groups.
Thanks.
--
Stuart Anderson anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson