Hi,
I guess, I'm a little defeated given these comments:
https://github.com/htcondor/htcondor/blob/634592ae74b9c39137a7df46876eb94b8cb8e22f/src/condor_startd.V6/command.cpp#L1088
        // TODO We really should follow the normal preemption
        // Âprocess, giving the preempted starter and schedd a
        // Âchance to kill the job in an orderly fashion.
        // TODO Should we call retire_claim() to go through
        // Âvacating_act instead of straight to killing_act?
Seems like there indeed is a reason for still calling that feature experimental :)
So, I guess we'll just have to tell the users that so desperately
want to backfill on that node, that they have to write checkpoints
regularly (as they should anyways..)
Thanks for any suggestions you had!
Best,
- Joachim
Hi Cole,
Thanks for chiming in!
I didn't explicitly set them earlier, as the defaults sounded like what I wanted..
I did explicitly set them to the following for testing purposes:
WANT_VACATE=True
KILL=False
The behavior is sadly still the same.
Best,
- Joachim
Am 25.03.26 um 20:28 schrieb Cole Bollig via HTCondor-users:
Hi Joachim,
What are WANT_VACATE and KILL set to on the EP?
-Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Beyer, Christoph <christoph.beyer@xxxxxxx>
Sent: Wednesday, March 25, 2026 12:40 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] PSLOT preemption - state change Claimed/Busy -> Preempting/Killing - Skipping "Preemting/Vacating"ÂAgreed - maxmachinevacatetime sounds like exactly what you want ....Â
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxxÂ
Von: Joachim <jmeyer@xxxxxxxxxxxxxxxxxx>
An: htcondor-users <htcondor-users@xxxxxxxxxxx>
Datum: Mittwoch, 25. MÃrz 2026 18:29 CET
Betreff: Re: [HTCondor-users] PSLOT preemption - state change Claimed/Busy -> Preempting/Killing - Skipping "Preemting/Vacating"
Hey,
seemed like an interesting suggestion :)
I did now also add (test timeouts obviously):
job_max_vacate_time = 120
kill_sig = SIGTERM
kill_sig_timeout = 120
want_graceful_removal = trueBut, this also doesn't seem to fix it.. :/
My understanding is that MachineMaxVacateTime (in conjunction with JobMaxVacateTime) should define the distance between the SIGTERM and SIGKILLÂ
I'm open to trying any other suggestions!
Best,
- JoachimÂ
Am 25.03.26 um 17:47 schrieb Beyer, Christoph:Hi,ÂI think you need to put a signal-wish into the job classadd:ÂÂkill_sig = < ... > (e.g. SIGTSTP)ÂA job with this setup will receive the signal you configured in order to preempt followed by a SIGKILL - I fear the time between the two signals is short but you will need to test that.ÂAlso I have no idea if you could alter the time between the two signals - the knob maybe missing to do so ...ÂBestchristophÂÂÂÂ
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxxÂ
Von: "Joachim Meyer" <jmeyer@xxxxxxxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 25. MÃrz 2026 16:35:52
Betreff: Re: [HTCondor-users] PSLOT preemption - state change Claimed/Busy -> Preempting/Killing - Skipping "Preemting/Vacating"ÂHi,
thanks for the response!
I did add the "(time() - Target.EnteredCurrentStatus) >= 3000" as a delay of 50min, so already running jobs can run for 50min after a higher prio job comes around.
However, the running job is not informed about it's situation, i.e. it did not get a SIGTERM or anything yet and thus doesn't know it's about to be killed.
I'd love to give those to-be-killed jobs, the 10min VacateTime to potentially write out a checkpoint or so - at least that's my understanding what the Max(Machine)VacateTime is actually intended for:
first send a SIGTERM, wait for VacateTime and then send a SIGKILL.But that only happens in the Preempting/Vacating action.. in our case, the starter is directly jumping to Preempting/Killing and I don't quite understand why it's skipping the Vacating action.
Not sure if I'm missing something to make the VacateTime take effect?Cheers,
- JoachimAm 25.03.26 um 15:04 schrieb Beyer, Christoph:Hi,ÂI think you should fold the delay into the preemption_requirement rather because as soon as the requirement is fullfilled it will preempt - we do not use preemption actively hence this is just a wild guess ...ÂÂÂBestchristophÂ
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxxÂ
Von: "Joachim Meyer" <jmeyer@xxxxxxxxxxxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 25. MÃrz 2026 14:55:56
Betreff: [HTCondor-users] PSLOT preemption - state change Claimed/Busy -> Preempting/Killing - Skipping "Preemting/Vacating"ÂHi all,
we are trying to implement preemption for a small subset of our nodes.
All our nodes have a single partitionable slot - most with GPUs (we actually only care about GPU nodes for preemption).
We are on 24.12across all EPs, APs and negotiator.The setup generally seems to work, withÂ
For the negotiator:
ALLOW_PSLOT_PREEMPTION = True
PREEMPTION_REQUIREMENTS =ÂTarget.AcctGroup =?= "prio" && !regexp("prio\..+@", My.AccountingGroup) && (time() - Target.EnteredCurrentStatus) >= 3000 && My.Machine =?= "the_special_machine"On the EP:
RANK = 0
MachineMaxVacateTime = 60*10
MAXJOBRETIREMENTTIME = 0
START = ... && (Target.MaxJobRetirementTime == 0 || AcctGroup =?= "prio")So, we basically get a guaranteed access delay of 1hour for a group "prio" toÂtheirÂmachine.
That's at least the idea.The 3000s max in the queue is working as I intended it to, however, the dynamic slots get immediately killed, once the preemption decision was made...
No vacate time at all - even though setting this to 60*10.
When condor_hold-ing jobs, they do get a SIGTERM first, and then a SIGKILL after a timeout - I would expect this to be similar here, from what I understand..?We do actively set MAXJOBRETIREMENTTIME to 0, as to my understanding, otherwise we would not get a guaranteed start after 1hr (as a new job might always snuggle in, with the PSLOT preemption).
But I at least want the jobs to have their vacate time, before going from SIGTERM to SIGKILL...The StartLog from the EP shows:
Â
03/25/26 12:04:10 slot1: Schedd addr = <10.143.248.61:9618?addrs=10.143.248.61-9618&alias=conduit2&noUDP&sock=schedd_2302500_1092>Can anyone give advice on whether it is possible with the pslot preemption to implement a policy that gives guaranteed maximum access delays for the "prio"group / the owner of a machine?
03/25/26 12:04:10 slot1: Alive interval = 300
03/25/26 12:04:10 slot1: Schedd sending 5 preempting claims.
03/25/26 12:04:10 slot1_2: Canceled ClaimLease timer (1859)
03/25/26 12:04:10 slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/25/26 12:04:10 ResMgr Âupdate_needed(0x2) -> 0x2 queuing timer
03/25/26 12:04:10 slot1_2: unbind DevIds for slot1.2 before : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1_3, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_2: unbind DevIds for slot1.2 after : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1_3, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_3: Canceled ClaimLease timer (1876)
03/25/26 12:04:10 slot1_3: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/25/26 12:04:10 ResMgr Âupdate_needed(0x2) -> 0x2 timer already queued
03/25/26 12:04:10 slot1_3: unbind DevIds for slot1.3 before : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1_3, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_3: ubind DevIds for slot1.3 unbind GPU-937c3d9c 1 OK
03/25/26 12:04:10 slot1_3: unbind DevIds for slot1.3 after : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_4: Canceled ClaimLease timer (1878)
03/25/26 12:04:10 slot1_4: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/25/26 12:04:10 ResMgr Âupdate_needed(0x2) -> 0x2 timer already queued
03/25/26 12:04:10 slot1_4: unbind DevIds for slot1.4 before : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1, GPU-de2e358d=1_4, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
03/25/26 12:04:10 slot1_4: ubind DevIds for slot1.4 unbind GPU-de2e358d 1 OK
03/25/26 12:04:10 slot1_4: unbind DevIds for slot1.4 after : GPUs:{GPU-694f8794=1_7, GPU-937c3d9c=1, GPU-de2e358d=1, GPU-4353d652=1_5, GPU-a9035eaf=1_6, GPU-036117e2=1_7, GPU-7e6bcf8a=1_7, GPU-db1fcb33=1_7, }
And more specifically, is there anything I am missing, to make jobs get their vacate time granted, when being preempted?Thanks!
- Joachim Meyer--
Joachim Meyer
HPC-Koordination & SupportUniversitÃt des Saarlandes FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken
T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/--
Joachim Meyer
HPC-Koordination & SupportUniversitÃt des Saarlandes FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken
T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/--
Joachim Meyer
HPC-Koordination & SupportUniversitÃt des Saarlandes FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken
T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/--
Joachim Meyer
HPC-Koordination & SupportUniversitÃt des Saarlandes FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken
T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
Joachim Meyer
HPC-Koordination & Support
UniversitÃt des Saarlandes FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken
T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de