[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Advertised Memory of an execute point running in Apptainer.



Hi all,

We try to run an HTCondor execute point (master, startd) in an apptainer container within a SLURM job. HTCondor starts and is able to run jobs. However, the advertised memory is always the memory of the host system or when I limit the memory via apptainer (apptainer run --memory ....). I tried setting the "Memory" class-ad in the config of the execute point, but it has no effect. I also tried it withMemory = "foo". The execute point still advertises the memory of the host system and accepted jobs. The start of the job failed because it tried to evaluate Memory="foo" and complained that it is no integer.

Do I need to set something?
The Slurm WN is a Rockylinux 9.5. I tried HTCondor versions 23.10.18 and 24.0.5. The Slurm cluster sund the cgroup v2 plugin slurmstepd.
Could it be a problem with how the groups are set up?
The memory limit of the job itself is set to what the job requests:

cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_3679/memory.max
536870912000

But the process of HTCondor runs in a sub-cgroup where the limit is set to max

cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_3679/step_batch/user/task_0/memory.max
max

Best regards,

Matthias