Dear HTCondor devs,
since I don't have access to the new (very well set-up!) JIRA bug tracking, let me use the mailing list to report an issue we observe with good old 8.8 through to the 9.0.16 series:
When starting a Singularity container job and attaching to it (or starting an interactive job), the process tree looks as follows:
condor 4703 \_ condor_startd
condor 3396708 \_ condor_starter -f -local-name slot_type_1 -a slot1_1 submitnode.physik.uni-bonn.de
someuser 3396952 \_ Singularity runtime parent
someuser 3396965 | \_ sinit
someuser 3396988 | \_ /bin/sh -c sleep 180 && while test -d ${_CONDOR_SCRATCH_DIR}/.condor_ssh_to_job_1; do /bin/slee
someuser 3396990 | \_ sleep 180
someuser 3396997 \_ sshd: someuser [priv]
someuser 3396999 | \_ sshd: someuser@pts/0
someuser 3397000 | \_ /usr/bin/condor_docker_enter
someuser 3397020 \_ /usr/bin/nsenter -t 3396988 -S 67803 -G 513 -m -i -p -r -w
someuser 3397021 \_ /bin/sh -l -i
However, the processes which "attached" later via nsenter do not end up in the same cgroup:
# cat /sys/fs/cgroup/memory/htcondor/condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/cgroup.procs
3396952
3396965
3396988
3396990
Subsequently, limit enforcement (CPUs, Memory) does not take place, neither for interactive jobs nor for processes spawned after using "condor_ssh_to_job".
Ideas for good workarounds (or of course a fix) welcome ;-).
I'll sadly not make it to HTCondor Europe this year, since it collides with the start of our winter term (technical support for lectures and teaching duties),
but I wish all of you a good time in Italy â hope to see you in person in one of the next years again!
Cheers from Bonn,
Oliver
--
Oliver Freyermuth
UniversitÃt Bonn
Physikalisches Institut, Raum 1.047
NuÃallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax: +49 228 73 7869
--
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature