Hey,
seemed like an interesting suggestion :)
I did now also add (test timeouts obviously):
job_max_vacate_time = 120
kill_sig = SIGTERM
kill_sig_timeout = 120
want_graceful_removal = true
But, this also doesn't seem to fix it.. :/
My understanding is that MachineMaxVacateTime (in conjunction with JobMaxVacateTime) should define the distance between the SIGTERM and SIGKILLÂ
I'm open to trying any other suggestions!
Best,
- Joachim
Hi,
I think you need to put a signal-wish into the job classadd:Â
kill_sig = < ... > (e.g. SIGTSTP)
A job with this setup will receive the signal you configured in order to preempt followed by a SIGKILL - I fear the time between the two signals is short but you will need to test that.
Also I have no idea if you could alter the time between the two signals - the knob maybe missing to do so ...
BestchristophÂ
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
Von: "Joachim Meyer" <jmeyer@xxxxxxxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 25. MÃrz 2026 16:35:52
Betreff: Re: [HTCondor-users] PSLOT preemption - state change Claimed/Busy -> Preempting/Killing - Skipping "Preemting/Vacating"
Hi,
thanks for the response!
I did add the "(time() - Target.EnteredCurrentStatus) >= 3000" as a delay of 50min, so already running jobs can run for 50min after a higher prio job comes around.
However, the running job is not informed about it's situation, i.e. it did not get a SIGTERM or anything yet and thus doesn't know it's about to be killed.
I'd love to give those to-be-killed jobs, the 10min VacateTime to potentially write out a checkpoint or so - at least that's my understanding what the Max(Machine)VacateTime is actually intended for:
first send a SIGTERM, wait for VacateTime and then send a SIGKILL.But that only happens in the Preempting/Vacating action.. in our case, the starter is directly jumping to Preempting/Killing and I don't quite understand why it's skipping the Vacating action.
Not sure if I'm missing something to make the VacateTime take effect?Cheers,
- JoachimAm 25.03.26 um 15:04 schrieb Beyer, Christoph:
Hi,
I think you should fold the delay into the preemption_requirement rather because as soon as the requirement is fullfilled it will preempt - we do not use preemption actively hence this is just a wild guess ...Â
Bestchristoph
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
Von: "Joachim Meyer" <jmeyer@xxxxxxxxxxxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 25. MÃrz 2026 14:55:56
Betreff: [HTCondor-users] PSLOT preemption - state change Claimed/Busy -> Preempting/Killing - Skipping "Preemting/Vacating"
Hi all,
we are trying to implement preemption for a small subset of our nodes.
All our nodes have a single partitionable slot - most with GPUs (we actually only care about GPU nodes for preemption).
We are on 24.12across all EPs, APs and negotiator.The setup generally seems to work, withÂ
For the negotiator:
ALLOW_PSLOT_PREEMPTION = True
PREEMPTION_REQUIREMENTS =ÂTarget.AcctGroup =?= "prio" && !regexp("prio\..+@", My.AccountingGroup) && (time() - Target.EnteredCurrentStatus) >= 3000 && My.Machine =?= "the_special_machine"On the EP:
RANK = 0
MachineMaxVacateTime = 60*10
MAXJOBRETIREMENTTIME = 0
START = ... && (Target.MaxJobRetirementTime == 0 || AcctGroup =?= "prio")So, we basically get a guaranteed access delay of 1hour for a group "prio" toÂtheirÂmachine.
That's at least the idea.The 3000s max in the queue is working as I intended it to, however, the dynamic slots get immediately killed, once the preemption decision was made...
No vacate time at all - even though setting this to 60*10.
When condor_hold-ing jobs, they do get a SIGTERM first, and then a SIGKILL after a timeout - I would expect this to be similar here, from what I understand..?We do actively set MAXJOBRETIREMENTTIME to 0, as to my understanding, otherwise we would not get a guaranteed start after 1hr (as a new job might always snuggle in, with the PSLOT preemption).
But I at least want the jobs to have their vacate time, before going from SIGTERM to SIGKILL...The StartLog from the EP shows:
03/25/26 12:04:10 slot1: Schedd addr = <10.143.248.61:9618?addrs=10.143.248.61-9618&alias=conduit2&noUDP&sock=schedd_2302500_1092>Can anyone give advice on whether it is possible with the pslot preemption to implement a policy that gives guaranteed maximum access delays for the "prio"group / the owner of a machine?
03/25/26 12:04:10 slot1: Alive interval = 300
03/25/26 12:04:10 slot1: Schedd sending 5 preempting claims.
03/25/26 12:04:10 slot1_2: Canceled ClaimLease timer (1859)
03/25/26 12:04:10 slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/25/26 12:04:10 ResMgr Âupdate_needed(0x2) -> 0x2 queuing timer
03/25/26 12:04:10 slot1_2: unbind DevIds for slot1.2 before : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1_3, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_2: unbind DevIds for slot1.2 after : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1_3, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_3: Canceled ClaimLease timer (1876)
03/25/26 12:04:10 slot1_3: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/25/26 12:04:10 ResMgr Âupdate_needed(0x2) -> 0x2 timer already queued
03/25/26 12:04:10 slot1_3: unbind DevIds for slot1.3 before : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1_3, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_3: ubind DevIds for slot1.3 unbind GPU-937c3d9c 1 OK
03/25/26 12:04:10 slot1_3: unbind DevIds for slot1.3 after : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_4: Canceled ClaimLease timer (1878)
03/25/26 12:04:10 slot1_4: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/25/26 12:04:10 ResMgr Âupdate_needed(0x2) -> 0x2 timer already queued
03/25/26 12:04:10 slot1_4: unbind DevIds for slot1.4 before : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_4: ubind DevIds for slot1.4 unbind GPU-de2e358d 1 OK
03/25/26 12:04:10 slot1_4: unbind DevIds for slot1.4 after : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1, GPU-de2e358d=1, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
And more specifically, is there anything I am missing, to make jobs get their vacate time granted, when being preempted?Thanks!
- Joachim Meyer--
Joachim Meyer
HPC-Koordination & SupportUniversitÃt des Saarlandes FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken
T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/--
Joachim Meyer
HPC-Koordination & SupportUniversitÃt des Saarlandes FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken
T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
Joachim Meyer
HPC-Koordination & Support
UniversitÃt des Saarlandes FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken
T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de