Hi all,
we are trying to implement preemption for a small subset of our
nodes.
All our nodes have a single partitionable slot - most with GPUs
(we actually only care about GPU nodes for preemption).
We are on 24.12across all EPs, APs and negotiator.
The setup generally seems to work, with
For the negotiator:
ALLOW_PSLOT_PREEMPTION = True
PREEMPTION_REQUIREMENTS = Target.AcctGroup =?= "prio" && !regexp("prio\..+@", My.AccountingGroup) && (time() - Target.EnteredCurrentStatus) >= 3000 && My.Machine =?= "the_special_machine"
On the EP:
RANK = 0
MachineMaxVacateTime = 60*10
MAXJOBRETIREMENTTIME = 0
START = ... && (Target.MaxJobRetirementTime == 0 ||
AcctGroup =?= "prio")
So, we basically get a guaranteed access delay of 1hour for a
group "prio" to their machine.
That's at least the idea.
The 3000s max in the queue is working as I intended it to,
however, the dynamic slots get immediately killed, once the
preemption decision was made...
No vacate time at all - even though setting this to 60*10.
When condor_hold-ing jobs, they do get a SIGTERM first, and then a
SIGKILL after a timeout - I would expect this to be similar here,
from what I understand..?
We do actively set MAXJOBRETIREMENTTIME to 0, as to my
understanding, otherwise we would not get a guaranteed start after
1hr (as a new job might always snuggle in, with the PSLOT
preemption).
But I at least want the jobs to have their vacate time, before
going from SIGTERM to SIGKILL...
The StartLog from the EP shows:
03/25/26 12:04:10 slot1: Schedd addr = <10.143.248.61:9618?addrs=10.143.248.61-9618&alias=conduit2&noUDP&sock=schedd_2302500_1092>Can anyone give advice on whether it is possible with the pslot preemption to implement a policy that gives guaranteed maximum access delays for the "prio"group / the owner of a machine?
03/25/26 12:04:10 slot1: Alive interval = 300
03/25/26 12:04:10 slot1: Schedd sending 5 preempting claims.
03/25/26 12:04:10 slot1_2: Canceled ClaimLease timer (1859)
03/25/26 12:04:10 slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/25/26 12:04:10 ResMgr update_needed(0x2) -> 0x2 queuing timer
03/25/26 12:04:10 slot1_2: unbind DevIds for slot1.2 before : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1_3, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_2: unbind DevIds for slot1.2 after : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1_3, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_3: Canceled ClaimLease timer (1876)
03/25/26 12:04:10 slot1_3: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/25/26 12:04:10 ResMgr update_needed(0x2) -> 0x2 timer already queued
03/25/26 12:04:10 slot1_3: unbind DevIds for slot1.3 before : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1_3, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_3: ubind DevIds for slot1.3 unbind GPU-937c3d9c 1 OK
03/25/26 12:04:10 slot1_3: unbind DevIds for slot1.3 after : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_4: Canceled ClaimLease timer (1878)
03/25/26 12:04:10 slot1_4: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/25/26 12:04:10 ResMgr update_needed(0x2) -> 0x2 timer already queued
03/25/26 12:04:10 slot1_4: unbind DevIds for slot1.4 before : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_4: ubind DevIds for slot1.4 unbind GPU-de2e358d 1 OK
03/25/26 12:04:10 slot1_4: unbind DevIds for slot1.4 after : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1, GPU-de2e358d=1, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
Thanks!
- Joachim Meyer
Joachim Meyer
HPC-Koordination & Support
UniversitÃt des Saarlandes FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken
T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx
www.uni-saarland.de