[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Bug: cgroup limits not enforced with Singularity containers and condor_ssh_to_job / interctive jobs



Hi Greg,

Am 04.10.22 um 23:47 schrieb Oliver Freyermuth:
One issue which has crept up on me again is a race in interactive container jobs in the starter here:
 Âhttps://github.com/htcondor/htcondor/blob/5e1c909f59372e029e3c6019f57c4688737e3b2f/src/condor_starter.V6.1/os_proc.cpp#L1190-L1195
OUr users sometimes manage to hit the "hope for the best" branch there (i.e. using singularity itself as PID to attach to) and then end up in wrong namespaces (e.g. wrong mount namespace)
if the filesystem on which the container is located is slow and "condor_submit -interactive" is fast to execute "condor_ssh_to_job".

I'll see if I can squeeze in some development time on this, probably the best-effort approach is to delay and retry in case there is no child of the Singularity process (yet) â
if I manage, I can contribute a PR :-).

I finally had some time to look into this today. Sadly, I found three possible races, of which I could observe (2) and (3) in reality with high "success" rate on a busy execute node
when submitting an interactive job with SINGULARITY_JOB=True (using latest HTCondor "main" right from git):

1. The starter may fall back to let nsenter attach to the namespaces of the main Singularity/Apptainer process
   ( the "hope for the best" branch in the code linked above ). I did not see this one in real life, but if this happens, we would end up in wrong namespaces (i.e. on the host).

2. Attaching to sinit / appinit (i.e. "PID 1 in the container") before it has concluded setup of all mounts.
   This seems to usually end up in the correct mount namespace, but does see parts of the host filesystem while connecting with condor_ssh_to_job (e.g. /etc/profile from the host)
   which causes an environment "mashup" for interactive shells.

3. Attaching to a child of sinit / appinit (i.e. a grandchild of Singularity / Apptainer) which vanishes while attaching,
   i.e. before the actual payload is started. This ends up in partially wrong namespaces, it seems Singularity / Apptainer does some setup stuff with children of the init process
   before namespace setup is complete and payload is called.

Test setup: An execute node with high load (stress-ng --hdd 8 --cpu 32) and slow / congested network, which we also see in real life.


I did originally want to contribute a PR, but now I'm not sure which solution to choose for this â (1) would be easy to fix (retry if there are no children),
(3) could be worked around assuming that recent Singularity / Apptainer always uses a PID namespace and an init process (i.e. "ignore grandchildren"),
but that could cause backwards incompatibility and older versions of nsenter + kernel may choose the wrong PID namespace [0],
and finally, for (2) I'm out of good ideas.

Do you have any good ideas on how to tackle this? Maybe there's an easy way out (catching all these races in one go) which I miss?


Cheers and all the best from Bonn,
	Oliver

[0] https://github.com/util-linux/util-linux/commit/0d5260b66c5581c8a5855a5f49e298e48e8baf82


Cheers from Bonn (and indeed hope I can join in person sometime soon),
 ÂÂÂÂOliver

[0] https://lists.cs.wisc.edu/archive/htcondor-users/2021-August/msg00132.shtml


We will miss you this year in Italy, but hope that you can join us in person soon!


-greg


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Oliver Freyermuth
UniversitÃt Bonn
Physikalisches Institut, Raum 1.047
NuÃallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature