Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Run Slurm as "guest" on a HTCondor pool?
- Date: Tue, 25 Nov 2025 14:13:03 +0000
- From: Michael DiDomenico <mdidomenico4@xxxxxxxxx>
- Subject: Re: [HTCondor-users] Run Slurm as "guest" on a HTCondor pool?
On Tue, Nov 25, 2025 at 9:22âAM Steffen Grunewald
<steffen.grunewald@xxxxxxxxxx> wrote:
> In practice, HTCondor starts up with the machine, controlled by a systemd unit,
> while the node needs to be "drained" from HTCondor work (by setting START=False
> and IS_OWNER=True), possibly defragmented ... then the slurmd is fired up.
> (I could even have the slurmd running all the time, but compared to the condor_*
> daemons, it consumes more memory ... which is almost always a bottleneck... hm.)
just my two cents. we do something similar. i have a
condor_startd_cron that runs every few seconds checking the process
table on a node for the slurmstepd process. if it finds one it evicts
condor from the node. then a timer in the cron script waits for an
hour after the slurmstepd process disappears before switching the node
back into unclaimed state
it seems to work 99% of the time. i've seen issues where a gpu job
take too long to evict and if the slurm job wants to immediately bind
to the gpu it'll fail, but otherwise we run both slurm and condor side
by side as root, without issue. i looked at all the other options of
having one scheduler submit to the other, but i couldn't get any of
them to work nor find much documentation