Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Run Slurm as "guest" on a HTCondor pool?
- Date: Tue, 25 Nov 2025 10:21:43 +0100
- From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Run Slurm as "guest" on a HTCondor pool?
Thanks Greg -
- for your quick response - which is a bit disillusioning I admit, but I'm
not willing to give up yet ;)
On Mon, 2025-11-24 at 11:28:40 -0600, HTCondor Users Mailinglist wrote:
> There are several different ways to share resources between slurm and
> htcondor, working with disparate systems is one of the challenges of the
> distributed world.
You tell me something as this setup is inhomogeneous and will be more of
that with the next extension :)
> Looks to me like you are suggesting what we would call
> "gliding in" a slurm over HTCondor.
Call it like that, indeed. But nobody seems to do HPC over HTC while the
other way is rather common.
> I'm not aware of anyone doing this.
I didn't find any hint on the 'net yet, which may have almost any reason :)
> My
> understanding is that slurm really wants to run as root. We in HTCondor try
> to prevent jobs from running as root, even when running in docker
> containers.
Now this is a heavy argument. As I'm more into apptainer/singularity than
Docker, I'm wondering whether there would be a way around that though (and
I've seen user software run as root inside a Docker container recently (not
in a HTCondor context though), so this must be possible somehow - time to
ask the responsible programmer to share his secrets!).
> The more common way is to run HTCondor and slurm "next to" each
> other,
This is my current approach ...
where perhaps both have root and are started by systemd/init, and one
> disables the other when it has work to do.
In practice, HTCondor starts up with the machine, controlled by a systemd unit,
while the node needs to be "drained" from HTCondor work (by setting START=False
and IS_OWNER=True), possibly defragmented ... then the slurmd is fired up.
(I could even have the slurmd running all the time, but compared to the condor_*
daemons, it consumes more memory ... which is almost always a bottleneck... hm.)
I've come up with some semi-automatic mechanism to "convert" nodes back and
forth between the two schedulers, controlled by a certain pattern in the reason
provided for the state change. The latter needs to be set manually right now.
(Sometimes I wish HTCondor had such a central "memory" of which nodes are to
be used.)
> I believe we have seen users do
> this with slurm prescripts and postscripts.
You're right, that would be a convenient place to do this! Investigating...
For the HTCondor side, perhaps STARTD_CRON tasks could do something similar,
although it's basically an alternative to Hibernation (as those transfers
should only be done on Unclaimed)... there are too many knobs and states!
> Let us know what you learn and how you decide to go!
Learning is slow these days, but I'll certainly share what I find ;)
Thanks,
Steffen