[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Sent vacate command to host for killing the job.



Hello Christoph,

Thanks for your reply.

Job goes into idle state and re-considered for match making. Snippet of shadow logs: In this case the job was vacated a few mins after the running state.

grep 30395.75 /var/log/condor/ShadowLog
06/25/24 02:57:59 Initializing a VANILLA shadow for job 30395.75
06/25/24 02:57:59 (30395.75) (933890): LIMIT_DIRECTORY_ACCESS = <unset>
06/25/24 02:58:00 (30395.75) (933890): Request to run on slot1_54@xxxxxxxxxxxxxxxxxxxxxxx <xx.xx.87.202:9618?addrs=xx.xx.87.202-9618&alias=condornode2982.test.com&noUDP&sock=startd_58319_8fcc> was ACCEPTED
06/25/24 03:02:27 (30395.75) (933890): Job 30395.75 is being evicted from slot1_54@xxxxxxxxxxxxxxxxxxxxxxx
06/25/24 03:02:27 (30395.75) (933890): **** condor_shadow (condor_SHADOW) pid 933890 EXITING WITH STATUS 107
06/25/24 03:02:30 Initializing a VANILLA shadow for job 30395.75
06/25/24 03:02:30 (30395.75) (937030): LIMIT_DIRECTORY_ACCESS = <unset>
06/25/24 03:02:30 (30395.75) (937030): Request to run on slot1_5@xxxxxxxxxxxxxxxxxxxxxxx <xx.xx.86.177:9618?addrs=xx.xx.86.177-9618&alias=condornode2701.test.com&noUDP&sock=startd_58570_ba11> was ACCEPTED
06/25/24 06:08:34 (30395.75) (937030): Job 30395.75 terminated: exited with status 0
06/25/24 06:08:34 (30395.75) (937030): **** condor_shadow (condor_SHADOW) pid 937030 EXITING WITH STATUS 115

From schedlog

06/25/24 02:57:59 (pid:20802) Starting add_shadow_birthdate(30395.75)
06/25/24 02:57:59 (pid:20802) Started shadow for job 30395.75 on slot1@xxxxxxxxxxxxxxxxxxxxxxx <xx.xx.87.202:9618?addrs=xx.xx.87.202-9618&alias=condornode2982.test.com&noUDP&sock=startd_58319_8fcc> for user1, (shadow pid = 933890)
06/25/24 03:02:27 (pid:20802) Sent vacate command to <xx.xx.87.202:9618?addrs=xx.xx.87.202-9618&alias=condornode2982.test.com&noUDP&sock=startd_58319_8fcc> for job 30395.75
06/25/24 03:02:27 (pid:20802) Shadow pid 933890 for job 30395.75 exited with status 107
06/25/24 03:02:27 (pid:20802) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxx <xx.xx.87.202:9618?addrs=xx.xx.87.202-9618&alias=condornode2982.test.com&noUDP&sock=startd_58319_8fcc> for user1, 30395.75) deleted
06/25/24 03:02:30 (pid:20802) Starting add_shadow_birthdate(30395.75)
06/25/24 03:02:30 (pid:20802) Started shadow for job 30395.75 on slot1@xxxxxxxxxxxxxxxxxxxxxxx <xx.xx.86.177:9618?addrs=xx.xx.86.177-9618&alias=condornode2701.test.com&noUDP&sock=startd_58570_ba11> for user1, (shadow pid = 937030)
06/25/24 06:08:34 (pid:20802) Shadow pid 937030 for job 30395.75 exited with status 115
06/25/24 06:08:34 (pid:20802) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxx <xx.xx.86.177:9618?addrs=xx.xx.86.177-9618&alias=condornode2701.test.com&noUDP&sock=startd_58570_ba11> for user1, 30395.75) deleted

# condor_history 30395.75 -af jobruncount
2

We have monitoring of sched cpu and daemoncoredutycycle, it was only in single digit.Â

Nope we haven't set up the defrag, these are the standard settings.

# condor_config_val -dump | grep -i defrag
DEFRAG = $(LIBEXEC)/condor_defrag
DEFRAG_CANCEL_REQUIREMENTS = $(DEFRAG_WHOLE_MACHINE_EXPR)
DEFRAG_DRAINING_MACHINES_PER_HOUR = 0
DEFRAG_DRAINING_SCHEDULE = graceful
DEFRAG_DRAINING_START_EXPR = FALSE
DEFRAG_INTERVAL = 600
DEFRAG_LOG = $(LOG)/DefragLog
DEFRAG_MAX_CONCURRENT_DRAINING = -1
DEFRAG_MAX_WHOLE_MACHINES = -1
DEFRAG_NAME =
DEFRAG_RANK = -ExpectedMachineGracefulDrainingBadput
DEFRAG_REQUIREMENTS =
DEFRAG_STATE_FILE = $(LOCK)/defrag_state
DEFRAG_UPDATE_INTERVAL = 300
DEFRAG_WHOLE_MACHINE_EXPR = Cpus == TotalSlotCpus


Thanks & Regards,
Vikrant Aggarwal


On Tue, Jun 25, 2024 at 2:25âAM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
arg,

hit the submit button to earlier - are you running any kind of defragmentation maybe ?

Maybe anything useful in the shadowlogs of the effected jobs ?

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Christoph Beyer" <christoph.beyer@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 25. Juni 2024 08:08:17
Betreff: Re: [HTCondor-users] Sent vacate command to host for killingÂÂÂÂÂÂÂÂtheÂÂÂÂÂÂÂÂjob.

Hi Vikrant,

still struggling with this - hmm.

Does the job go back into the queue afterwards or does it end up in history as 'X' ?

It might also be the job coming to a natural end and defining something weird in the 'on_exit_remove' section ?

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Vikrant Aggarwal" <ervikrant06@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 25. Juni 2024 01:38:57
Betreff: Re: [HTCondor-users] Sent vacate command to host for killing theÂÂÂÂÂÂÂÂjob.

Thanks Jamie forÂgetting back to me.

I haven't seen the message about max running jobs.
Aside note I thought sched will never allow more jobs running than the limit at first place, btw this job was considered for preempt after 2-3 hours of runtime. If it's max limit (which I believe it's not) then shouldn't it decide to kill the jobs with less runtime?

Regarding sched overloaded are you indicating high sched cpu utilisation or something else?


On Mon, Jun 24, 2024, 5:06âPM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
The "Sent vacate commandâ message is written in the SchedLog when the schedd is vacating jobs because itâs shutting down or it believes it has too many jobs running (and consuming too many resources on the scheddâs machine).
If the latter is the case, then you will also see this message: "Preempting ### jobs due to MAX_JOBS_RUNNING changeâ.
You should also see this message: "Called preempt( %d ) ââ.

Â- Jaime

On Jun 21, 2024, at 10:14âAM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello Experts,

Any idea what else could cause this issue?Â

Thanks & Regards,
Vikrant Aggarwal


On Tue, Jun 18, 2024 at 10:18âAM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
thanks for your response,Â

We are not using preemption at all. Even though NEGOTIATOR_CONSIDER_PREEMPTION is true here (on worker nodes) but on negotiator this setting is False.Â

# condor_config_val NEGOTIATOR_CONSIDER_PREEMPTION PREEMPTION_RANK RANK PREEMPTION_REQUIREMENTS
true
(RemoteUserPrio * 1000000) - ifThenElse(isUndefined(TotalJobRuntime), 0, TotalJobRuntime)
0
False


Thanks & Regards,
Vikrant Aggarwal


On Mon, Jun 17, 2024 at 5:00âPM Zach McGrew <mcgrewz@xxxxxxx> wrote:
Are you using Rank based preemption?

Is that EP's RANK set to something that could evaluate higher for one job compared to another, or is it a constant?

-Zach

________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Monday, June 17, 2024 10:47 AM
To: HTCondor-Users Mail List
Subject: [HTCondor-users] Sent vacate command to host for killing the job.

You don't often get email from ervikrant06@xxxxxxxxx. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hello Experts,

We are seeing a few jobs in the batch randomly getting killed, following messages reported in the shadow log file. We are not sending any vacate command:

/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Sent vacate command to <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for job 2401567.2308
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Shadow pid 1785748 for job 2401567.2308 exited with status 107
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxx> <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for user1, 2401567.2308) deleted
/var/log/condor/SchedLog:06/14/24 16:02:49 (pid:10490) Starting add_shadow_birthdate(2401567.2308)
/var/log/condor/SchedLog:06/14/24 16:02:49 (pid:10490) Started shadow for job 2401567.2308 on slot1@xxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxx> <xx.xx.48.122:9618?addrs=xx.xx.48.122-9618&alias=aresnode0121test.com<http://aresnode0121test.com/>&noUDP&sock=startd_24441_1251> for user1, (shadow pid = 2178208)
/var/log/condor/SchedLog.old:06/14/24 12:13:55 (pid:10490) job_transforms for 2401567.2308: 1 considered, 1 applied (SetTeam)
/var/log/condor/SchedLog.old:06/14/24 12:14:44 (pid:10490) Starting add_shadow_birthdate(2401567.2308)
/var/log/condor/SchedLog.old:06/14/24 12:14:44 (pid:10490) Started shadow for job 2401567.2308 on slot1@xxxxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxx> <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for user1, (shadow pid = 1785748)


StartLog on worker node shows.. slot logs don't show anything useful except signal 15.

06/14/24 15:57:45 slot1_46: Called deactivate_claim()
06/14/24 15:57:45 slot1_46: Called deactivate_claim()
06/14/24 15:57:45 slot1_46: Changing state and activity: Claimed/Busy -> Preempting/Vacating
06/14/24 15:57:45 slot1_46: State change: starter exited
06/14/24 15:57:45 slot1_46: State change: No preempting claim, returning to owner
06/14/24 15:57:45 slot1_46: Changing state and activity: Preempting/Vacating -> Owner/Idle
06/14/24 15:57:45 slot1_46: State change: IS_OWNER is false
06/14/24 15:57:45 slot1_46: Changing state: Owner -> Unclaimed
06/14/24 15:57:45 slot1_46: Changing state: Unclaimed -> Delete
06/14/24 15:57:45 slot1_46: Resource no longer needed, deleting
06/14/24 15:57:54 slot1_46: New machine resource of type -1 allocated

We are not taking any manual or automatic (periodic release) action on the jobs. Preemption is disabled in our cluster on the Negotiator node, Any thoughts what could cause this issue?


Thanks & Regards,
Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/