Hi everyone,
we're trying to enable Apptainer with our existing HTCondor install (+GPUs).
I'm stuck at a few points:
- condor_ssh_to_job leads to numerous warnings & errors with cgroups
- when running a job in apptainer, GPUs are not being found (also nvidia-smi is not mounted) - this works if I just run the command from the "target user"
- $HOME environment variable is (needlessly?) overwritten
- do I actually want to use apptainer-suid? It sounds like it'd be more powerful if e.g. users have multiple groups - but haven't seen any mention of requiring this in HTCondor documentation.
- are there any differences between container_image and manual +singularity_image shenanigans, except that container_image might use docker, if a docker:// uri is used?
Version info on EP:
$CondorVersion: 24.7.3 2025-04-22 BuildID: 803720 PackageID: 24.7.3-1+ubu22 GitSHA: e207c094 $
$CondorPlatform: X86_64-Ubuntu_22.04 $
apptainer version 1.3.4 (tried with additionally installing apptainer-suid)
AP is HTCondor 23.x.
Singularity related config:
SINGULARITY_RUN_TEST_BEFORE_JOB = False
# SINGULARITY_IS_SETUID = True
# SINGULARITY_USE_PID_NAMESPACES = True
SINGULARITY_BIND_EXPR = strcat(ifThenElse(AcctGroup == "chair_valera", "/share_chairvalera ", ""), \
ifThenElse(AcctGroup == "chair_ilg", "/share_chairilg ", ""), \
ifThenElse(WantScratchMounted isnt Undefined && WantScratchMounted, "/scratch ",""), \
ifThenElse(WantGPUHomeMounted isnt Undefined && WantGPUHomeMounted, "/home ", ""))
SINGULARITY_VERBOSITY=-v
More info w.r.t. to the points above.
condor_ssh_to_job or just interactive jobs:
See log:
https://kingsx.cs.uni-saarland.de/index.php/s/yA5d7DqLsfjiSTK
The job seems to successfully start and everything cgroup related seems to work out fine as well, however when a condor_ssh_to_job request comes in, things start to break down:
> 05/21/25 14:30:21 ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/cgroup.subtree_control: Device or resource busy
> 05/21/25 14:30:21 Error setting cgroup memory limit of 107374182400 in cgroup /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory
> 05/21/25 14:30:21 Error setting cgroup swap limit of 107374182400 in cgroup /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory
> 05/21/25 14:30:21 Error setting cgroup cpu weight of 1200 in cgroup /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory
> 05/21/25 14:30:21 Error enabling per-cgroup oom killing: 2 (No such file or directory)
> 05/21/25 14:30:21 cgroup v2 could not attach gpu device limiter to cgroup: Operation not permitted
Further down, the log is spammed with
> 05/21/25 14:30:26 ProcFamilyDirectCgroupV2::get_usage cannot open /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd/memory.stat: 2 No such file or directory
> 05/21/25 14:30:26 error polling family usage
Can you give any information on what is happening here/what I might have to fix in the set up?
Apptainer & GPUs
When requesting GPUs with the apptainer job, I can see that --nv is being added to the commandline, as expected.
However, I don't see the required files being mapped (e.g. nvidia-smi) and thus I can't see the Nvidia GPUs.
Looking at the stderr log with -v added to the singularity command line, I can see the following messages:
> VERBOSE: persistenced socket /var/run/nvidia-persistenced/socket not found
> WARNING: Could not find any nv files on this host!
However, running the command from the log manually, works without any issue (while still producing the persistenced message):
> /usr/bin/singularity -v exec -S /tmp -S /var/tmp -W /raid/condor/lib/condor/execute/dir_173820 --pwd /raid/condor/lib/condor/execute/dir_173820 -B /raid/condor/lib/condor/execute/dir_173820 --nv -B /scratch -B /home -B /etc/OpenCL/vendors --home /raid/condor/lib/condor/execute/dir_173820 -C /home/jmeyer/condor_tutorial/pytorch_24.02-py3.sif bash
Any clues what is happening here?
(Note: setting the apptainer.conf option "nvidia-container-cli", which reminds me of how docker handles stuff, doesn't seem to help either at the moment)
HOME environment variable:
The --home argument in the singularity command line seems to override my HOME environment variable as well. This is kinda annoying when mounting the shared file system (/home) where my actual home is - we even automatically add HOME to the getenv as a submit transformation to ensure users get their HOME variable set as they would expect it.
Can we maybe override this behavior?
Appreciate any help!
Thanks,
- Joachim Meyer
--
Joachim Meyer
HPC-Koordination & Support
UniversitÃt des Saarlandes
FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03
66123 SaarbrÃcken
T: +49 681 302-57522
jmeyer@xxxxxxxxxxxxxxxxxx
www.uni-saarland.de