[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_ssh_to_job/interactive jobs with apptainer



Hi everyone,


we're trying to enable Apptainer with our existing HTCondor install (+GPUs).


I'm stuck at a few points:

- condor_ssh_to_job leads to numerous warnings & errors with cgroups

- when running a job in apptainer, GPUs are not being found (also nvidia-smi is not mounted) - this works if I just run the command from the "target user"

- $HOME environment variable is (needlessly?) overwritten

- do I actually want to use apptainer-suid? It sounds like it'd be more powerful if e.g. users have multiple groups - but haven't seen any mention of requiring this in HTCondor documentation.

- are there any differences between container_image and manual +singularity_image shenanigans, except that container_image might use docker, if a docker:// uri is used?


Version info on EP:

$CondorVersion: 24.7.3 2025-04-22 BuildID: 803720 PackageID: 24.7.3-1+ubu22 GitSHA: e207c094 $

$CondorPlatform: X86_64-Ubuntu_22.04 $

apptainer version 1.3.4 (tried with additionally installing apptainer-suid)


AP is HTCondor 23.x.


Singularity related config:

SINGULARITY_RUN_TEST_BEFORE_JOB = False

# SINGULARITY_IS_SETUID = True

# SINGULARITY_USE_PID_NAMESPACES = True


SINGULARITY_BIND_EXPR = strcat(ifThenElse(AcctGroup == "chair_valera", "/share_chairvalera ", ""), \

  ifThenElse(AcctGroup == "chair_ilg", "/share_chairilg ", ""), \

  ifThenElse(WantScratchMounted isnt Undefined && WantScratchMounted, "/scratch ",""), \

  ifThenElse(WantGPUHomeMounted isnt Undefined && WantGPUHomeMounted, "/home ", ""))


SINGULARITY_VERBOSITY=-v



More info w.r.t. to the points above.

condor_ssh_to_job or just interactive jobs:

See log:

https://kingsx.cs.uni-saarland.de/index.php/s/yA5d7DqLsfjiSTK


The job seems to successfully start and everything cgroup related seems to work out fine as well, however when a condor_ssh_to_job request comes in, things start to break down:

> 05/21/25 14:30:21 ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/cgroup.subtree_control: Device or resource busy

> 05/21/25 14:30:21 Error setting cgroup memory limit of 107374182400 in cgroup /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory

> 05/21/25 14:30:21 Error setting cgroup swap limit of 107374182400 in cgroup /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory

> 05/21/25 14:30:21 Error setting cgroup cpu weight of 1200 in cgroup /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory

> 05/21/25 14:30:21 Error enabling per-cgroup oom killing: 2 (No such file or directory)

> 05/21/25 14:30:21 cgroup v2 could not attach gpu device limiter to cgroup: Operation not permitted

Further down, the log is spammed with

> 05/21/25 14:30:26 ProcFamilyDirectCgroupV2::get_usage cannot open /sys/fs/cgroup/system.slice/htcondor/condor_raid_condor_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxx/sshd/memory.stat: 2 No such file or directory

> 05/21/25 14:30:26 error polling family usage


Can you give any information on what is happening here/what I might have to fix in the set up?


Apptainer & GPUs

When requesting GPUs with the apptainer job, I can see that --nv is being added to the commandline, as expected.

However, I don't see the required files being mapped (e.g. nvidia-smi) and thus I can't see the Nvidia GPUs.

Looking at the stderr log with -v added to the singularity command line, I can see the following messages:

 

> VERBOSE: persistenced socket /var/run/nvidia-persistenced/socket not found

> WARNING: Could not find any nv files on this host!


However, running the command from the log manually, works without any issue (while still producing the persistenced message):

> /usr/bin/singularity -v exec -S /tmp -S /var/tmp -W /raid/condor/lib/condor/execute/dir_173820 --pwd /raid/condor/lib/condor/execute/dir_173820 -B /raid/condor/lib/condor/execute/dir_173820 --nv -B /scratch -B /home -B /etc/OpenCL/vendors --home /raid/condor/lib/condor/execute/dir_173820 -C /home/jmeyer/condor_tutorial/pytorch_24.02-py3.sif bash


Any clues what is happening here?


(Note: setting the apptainer.conf option "nvidia-container-cli", which reminds me of how docker handles stuff, doesn't seem to help either at the moment)


HOME environment variable:

The --home argument in the singularity command line seems to override my HOME environment variable as well. This is kinda annoying when mounting the shared file system (/home) where my actual home is - we even automatically add HOME to the getenv as a submit transformation to ensure users get their HOME variable set as they would expect it.

Can we maybe override this behavior?


Appreciate any help!

Thanks,

- Joachim Meyer


--

Joachim Meyer

HPC-Koordination & Support


UniversitÃt des Saarlandes

FR Informatik | HPC


Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken


Besucheranschrift: Campus E1 3 | Raum 4.03

66123 SaarbrÃcken


T: +49 681 302-57522

jmeyer@xxxxxxxxxxxxxxxxxx

www.uni-saarland.de