[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Sent vacate command to host for killing the job.



thanks for your response,Â

We are not using preemption at all. Even though NEGOTIATOR_CONSIDER_PREEMPTION is true here (on worker nodes) but on negotiator this setting is False.Â

# condor_config_val NEGOTIATOR_CONSIDER_PREEMPTION PREEMPTION_RANK RANK PREEMPTION_REQUIREMENTS
true
(RemoteUserPrio * 1000000) - ifThenElse(isUndefined(TotalJobRuntime), 0, TotalJobRuntime)
0
False


Thanks & Regards,
Vikrant Aggarwal


On Mon, Jun 17, 2024 at 5:00âPM Zach McGrew <mcgrewz@xxxxxxx> wrote:
Are you using Rank based preemption?

Is that EP's RANK set to something that could evaluate higher for one job compared to another, or is it a constant?

-Zach

________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Monday, June 17, 2024 10:47 AM
To: HTCondor-Users Mail List
Subject: [HTCondor-users] Sent vacate command to host for killing the job.

You don't often get email from ervikrant06@xxxxxxxxx. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hello Experts,

We are seeing a few jobs in the batch randomly getting killed, following messages reported in the shadow log file. We are not sending any vacate command:

/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Sent vacate command to <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for job 2401567.2308
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Shadow pid 1785748 for job 2401567.2308 exited with status 107
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxx> <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for user1, 2401567.2308) deleted
/var/log/condor/SchedLog:06/14/24 16:02:49 (pid:10490) Starting add_shadow_birthdate(2401567.2308)
/var/log/condor/SchedLog:06/14/24 16:02:49 (pid:10490) Started shadow for job 2401567.2308 on slot1@xxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxx> <xx.xx.48.122:9618?addrs=xx.xx.48.122-9618&alias=aresnode0121test.com<http://aresnode0121test.com/>&noUDP&sock=startd_24441_1251> for user1, (shadow pid = 2178208)
/var/log/condor/SchedLog.old:06/14/24 12:13:55 (pid:10490) job_transforms for 2401567.2308: 1 considered, 1 applied (SetTeam)
/var/log/condor/SchedLog.old:06/14/24 12:14:44 (pid:10490) Starting add_shadow_birthdate(2401567.2308)
/var/log/condor/SchedLog.old:06/14/24 12:14:44 (pid:10490) Started shadow for job 2401567.2308 on slot1@xxxxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxx> <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for user1, (shadow pid = 1785748)


StartLog on worker node shows.. slot logs don't show anything useful except signal 15.

06/14/24 15:57:45 slot1_46: Called deactivate_claim()
06/14/24 15:57:45 slot1_46: Called deactivate_claim()
06/14/24 15:57:45 slot1_46: Changing state and activity: Claimed/Busy -> Preempting/Vacating
06/14/24 15:57:45 slot1_46: State change: starter exited
06/14/24 15:57:45 slot1_46: State change: No preempting claim, returning to owner
06/14/24 15:57:45 slot1_46: Changing state and activity: Preempting/Vacating -> Owner/Idle
06/14/24 15:57:45 slot1_46: State change: IS_OWNER is false
06/14/24 15:57:45 slot1_46: Changing state: Owner -> Unclaimed
06/14/24 15:57:45 slot1_46: Changing state: Unclaimed -> Delete
06/14/24 15:57:45 slot1_46: Resource no longer needed, deleting
06/14/24 15:57:54 slot1_46: New machine resource of type -1 allocated

We are not taking any manual or automatic (periodic release) action on the jobs. Preemption is disabled in our cluster on the Negotiator node, Any thoughts what could cause this issue?


Thanks & Regards,
Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/