[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Sent vacate command to host for killing the job.



arg,

hit the submit button to earlier - are you running any kind of defragmentation maybe ?

Maybe anything useful in the shadowlogs of the effected jobs ?

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Christoph Beyer" <christoph.beyer@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 25. Juni 2024 08:08:17
Betreff: Re: [HTCondor-users] Sent vacate command to host for killing        the        job.

Hi Vikrant,

still struggling with this - hmm.

Does the job go back into the queue afterwards or does it end up in history as 'X' ?

It might also be the job coming to a natural end and defining something weird in the 'on_exit_remove' section ?

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Vikrant Aggarwal" <ervikrant06@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 25. Juni 2024 01:38:57
Betreff: Re: [HTCondor-users] Sent vacate command to host for killing the        job.

Thanks Jamie for getting back to me.

I haven't seen the message about max running jobs.
Aside note I thought sched will never allow more jobs running than the limit at first place, btw this job was considered for preempt after 2-3 hours of runtime. If it's max limit (which I believe it's not) then shouldn't it decide to kill the jobs with less runtime?

Regarding sched overloaded are you  indicating high sched cpu utilisation or something else?


On Mon, Jun 24, 2024, 5:06âPM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
The "Sent vacate commandâ message is written in the SchedLog when the schedd is vacating jobs because itâs shutting down or it believes it has too many jobs running (and consuming too many resources on the scheddâs machine).
If the latter is the case, then you will also see this message: "Preempting ### jobs due to MAX_JOBS_RUNNING changeâ.
You should also see this message: "Called preempt( %d ) ââ.

 - Jaime

On Jun 21, 2024, at 10:14âAM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello Experts,

Any idea what else could cause this issue? 

Thanks & Regards,
Vikrant Aggarwal


On Tue, Jun 18, 2024 at 10:18âAM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
thanks for your response, 

We are not using preemption at all. Even though NEGOTIATOR_CONSIDER_PREEMPTION is true here (on worker nodes) but on negotiator this setting is False. 

# condor_config_val NEGOTIATOR_CONSIDER_PREEMPTION PREEMPTION_RANK RANK PREEMPTION_REQUIREMENTS
true
(RemoteUserPrio * 1000000) - ifThenElse(isUndefined(TotalJobRuntime), 0, TotalJobRuntime)
0
False


Thanks & Regards,
Vikrant Aggarwal


On Mon, Jun 17, 2024 at 5:00âPM Zach McGrew <mcgrewz@xxxxxxx> wrote:
Are you using Rank based preemption?

Is that EP's RANK set to something that could evaluate higher for one job compared to another, or is it a constant?

-Zach

________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Monday, June 17, 2024 10:47 AM
To: HTCondor-Users Mail List
Subject: [HTCondor-users] Sent vacate command to host for killing the job.

You don't often get email from ervikrant06@xxxxxxxxx. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hello Experts,

We are seeing a few jobs in the batch randomly getting killed, following messages reported in the shadow log file. We are not sending any vacate command:

/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Sent vacate command to <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for job 2401567.2308
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Shadow pid 1785748 for job 2401567.2308 exited with status 107
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxx> <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for user1, 2401567.2308) deleted
/var/log/condor/SchedLog:06/14/24 16:02:49 (pid:10490) Starting add_shadow_birthdate(2401567.2308)
/var/log/condor/SchedLog:06/14/24 16:02:49 (pid:10490) Started shadow for job 2401567.2308 on slot1@xxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxx> <xx.xx.48.122:9618?addrs=xx.xx.48.122-9618&alias=aresnode0121test.com<http://aresnode0121test.com/>&noUDP&sock=startd_24441_1251> for user1, (shadow pid = 2178208)
/var/log/condor/SchedLog.old:06/14/24 12:13:55 (pid:10490) job_transforms for 2401567.2308: 1 considered, 1 applied (SetTeam)
/var/log/condor/SchedLog.old:06/14/24 12:14:44 (pid:10490) Starting add_shadow_birthdate(2401567.2308)
/var/log/condor/SchedLog.old:06/14/24 12:14:44 (pid:10490) Started shadow for job 2401567.2308 on slot1@xxxxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxx> <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for user1, (shadow pid = 1785748)


StartLog on worker node shows.. slot logs don't show anything useful except signal 15.

06/14/24 15:57:45 slot1_46: Called deactivate_claim()
06/14/24 15:57:45 slot1_46: Called deactivate_claim()
06/14/24 15:57:45 slot1_46: Changing state and activity: Claimed/Busy -> Preempting/Vacating
06/14/24 15:57:45 slot1_46: State change: starter exited
06/14/24 15:57:45 slot1_46: State change: No preempting claim, returning to owner
06/14/24 15:57:45 slot1_46: Changing state and activity: Preempting/Vacating -> Owner/Idle
06/14/24 15:57:45 slot1_46: State change: IS_OWNER is false
06/14/24 15:57:45 slot1_46: Changing state: Owner -> Unclaimed
06/14/24 15:57:45 slot1_46: Changing state: Unclaimed -> Delete
06/14/24 15:57:45 slot1_46: Resource no longer needed, deleting
06/14/24 15:57:54 slot1_46: New machine resource of type -1 allocated

We are not taking any manual or automatic (periodic release) action on the jobs. Preemption is disabled in our cluster on the Negotiator node, Any thoughts what could cause this issue?


Thanks & Regards,
Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/