hit the submit button to earlier - are you running any kind of defragmentation maybe ?
Hi Vikrant,
still struggling with this - hmm.
Does the job go back into the queue afterwards or does it end up in history as 'X' ?
It might also be the job coming to a natural end and defining something weird in the 'on_exit_remove' section ?
Best
christoph
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
Von: "Vikrant Aggarwal" <ervikrant06@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 25. Juni 2024 01:38:57
Betreff: Re: [HTCondor-users] Sent vacate command to host for killing the job.
Thanks Jamie for getting back to me.
I haven't seen the message about max running jobs.
Aside note I thought sched will never allow more jobs running than the limit at first place, btw this job was considered for preempt after 2-3 hours of runtime. If it's max limit (which I believe it's not) then shouldn't it decide to kill the jobs with less runtime?
Regarding sched overloaded are you indicating high sched cpu utilisation or something else?
The "Sent vacate commandâ message is written in the SchedLog when the schedd is vacating jobs because itâs shutting down or it believes it has too many jobs running (and consuming too many resources on the scheddâs machine).
If the latter is the case, then you will also see this message: "Preempting ### jobs due to MAX_JOBS_RUNNING changeâ.
You should also see this message: "Called preempt( %d ) ââ.
- Jaime
Hello Experts,
Any idea what else could cause this issue?
Thanks & Regards,
Vikrant Aggarwal
thanks for your response,
We are not using preemption at all. Even though NEGOTIATOR_CONSIDER_PREEMPTION is true here (on worker nodes) but on negotiator this setting is False.
# condor_config_val NEGOTIATOR_CONSIDER_PREEMPTION PREEMPTION_RANK RANK PREEMPTION_REQUIREMENTS
true
(RemoteUserPrio * 1000000) - ifThenElse(isUndefined(TotalJobRuntime), 0, TotalJobRuntime)
0
False
Thanks & Regards,
Vikrant Aggarwal
Are you using Rank based preemption?
Is that EP's RANK set to something that could evaluate higher for one job compared to another, or is it a constant?
-Zach
________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Monday, June 17, 2024 10:47 AM
To: HTCondor-Users Mail List
Subject: [HTCondor-users] Sent vacate command to host for killing the job.
You don't often get email from
ervikrant06@xxxxxxxxx. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hello Experts,
We are seeing a few jobs in the batch randomly getting killed, following messages reported in the shadow log file. We are not sending any vacate command:
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Sent vacate command to <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e>
for job 2401567.2308
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Shadow pid 1785748 for job 2401567.2308 exited with status 107
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxx>
<xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e>
for user1, 2401567.2308) deleted
/var/log/condor/SchedLog:06/14/24 16:02:49 (pid:10490) Starting add_shadow_birthdate(2401567.2308)
/var/log/condor/SchedLog:06/14/24 16:02:49 (pid:10490) Started shadow for job 2401567.2308 on
slot1@xxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxx> <xx.xx.48.122:9618?addrs=xx.xx.48.122-9618&alias=aresnode0121test.com<http://aresnode0121test.com/>&noUDP&sock=startd_24441_1251>
for user1, (shadow pid = 2178208)
/var/log/condor/SchedLog.old:06/14/24 12:13:55 (pid:10490) job_transforms for 2401567.2308: 1 considered, 1 applied (SetTeam)
/var/log/condor/SchedLog.old:06/14/24 12:14:44 (pid:10490) Starting add_shadow_birthdate(2401567.2308)
/var/log/condor/SchedLog.old:06/14/24 12:14:44 (pid:10490) Started shadow for job 2401567.2308 on
slot1@xxxxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxx> <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e>
for user1, (shadow pid = 1785748)
StartLog on worker node shows.. slot logs don't show anything useful except signal 15.
06/14/24 15:57:45 slot1_46: Called deactivate_claim()
06/14/24 15:57:45 slot1_46: Called deactivate_claim()
06/14/24 15:57:45 slot1_46: Changing state and activity: Claimed/Busy -> Preempting/Vacating
06/14/24 15:57:45 slot1_46: State change: starter exited
06/14/24 15:57:45 slot1_46: State change: No preempting claim, returning to owner
06/14/24 15:57:45 slot1_46: Changing state and activity: Preempting/Vacating -> Owner/Idle
06/14/24 15:57:45 slot1_46: State change: IS_OWNER is false
06/14/24 15:57:45 slot1_46: Changing state: Owner -> Unclaimed
06/14/24 15:57:45 slot1_46: Changing state: Unclaimed -> Delete
06/14/24 15:57:45 slot1_46: Resource no longer needed, deleting
06/14/24 15:57:54 slot1_46: New machine resource of type -1 allocated
We are not taking any manual or automatic (periodic release) action on the jobs. Preemption is disabled in our cluster on the Negotiator node, Any thoughts what could cause this issue?
Thanks & Regards,
Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/