[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Sent vacate command to host for killing the job.



The value of DETECTED_MEMORY shouldnât change while the schedd is running, and the schedd only calculates the value of the MAX_JOBS_RUNNING config knob at startup and on a condor_reconfig. So I donât think the issue is the value changing while the schedd runs, despite what the log message says.

I suspect the schedd is unintentionally starting one too many jobs, then detecting the mistake and preempting one of the jobs. Starting a new job consists of several steps between receiving a match from the negotiator and launching a new condor_shadow process to handle running the job. The schedd checks the MAX_JOBS_RUNNING limit when it receives the match from the negotiator and counts the number of condor_shadow processes already running plus the number of jobs itâs already in the process of starting, to see if the new match will put it over the limit. I think that count ends up being slightly off in some situations.

 - Jaime

On Jun 26, 2024, at 4:18âPM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello Jaime and Christoph,

You guys are right, sorry for the mistake from my side. In the latest messages, I do see jobs getting preempted because of MAX_JOBS_RUNNING Limit, this limit shouldn't change, it should remain 10K (as detected memory is always static and more than 10K). Unfortunately, logs from the previous occurrences have been rotated, I wanted to confirm whether this message really appeared for old jobs also.

# condor_config_val -dump | grep -i max_jobs_running
MAX_JOBS_RUNNING = MIN({$(DETECTED_MEMORY), 10000})

# condor_config_val max_jobs_running
MIN({192099, 10000})

06/25/24 03:02:26 (pid:20802) Reached MAX_JOBS_RUNNING: no more can run
06/25/24 03:02:26 (pid:20802) Finished sending RRL for test1
06/25/24 03:02:27 (pid:20802) TransferQueueManager stats: active up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
06/25/24 03:02:27 (pid:20802) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
06/25/24 03:02:27 (pid:20802) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
06/25/24 03:02:27 (pid:20802) Preempting 1 jobs due to MAX_JOBS_RUNNING change
06/25/24 03:02:27 (pid:20802) Called preempt( 1 )
06/25/24 03:02:27 (pid:20802) Sent vacate command to <xx.xx87.202:9618?addrs=xx.xx87.202-9618&alias=condornode2982.example.com&noUDP&sock=startd_58319_8fcc> for job 30395.75
06/25/24 03:02:27 (pid:20802) Activity on stashed negotiator socket: <xx.xx80.254:8379>
06/25/24 03:02:27 (pid:20802) Using negotiation protocol: NEGOTIATE
06/25/24 03:02:27 (pid:20802) Negotiating for owner: test1@xxxxxxxxxxx (flock level 1, pool master1.example.com)
06/25/24 03:02:27 (pid:20802) Reached MAX_JOBS_RUNNING: no more can run
06/25/24 03:02:27 (pid:20802) Finished sending RRL for test1
06/25/24 03:02:27 (pid:20802) Completed DEACTIVATE_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx <xx.xx87.202:9618?addrs=xx.xx87.202-9618&alias=condornode2982.example.com&noUDP&sock=startd_58319_8fcc> for test1
06/25/24 03:02:27 (pid:20802) Activity on stashed negotiator socket: <xx.xx80.254:8379>
06/25/24 03:02:27 (pid:20802) Using negotiation protocol: NEGOTIATE
06/25/24 03:02:27 (pid:20802) Negotiating for owner: test1@xxxxxxxxxxx (flock level 1, pool master1.example.com)
06/25/24 03:02:27 (pid:20802) Reached MAX_JOBS_RUNNING: no more can run
06/25/24 03:02:27 (pid:20802) Finished sending RRL for test1
06/25/24 03:02:27 (pid:20802) Shadow pid 731129 for job 30280.0 exited with status 115

I am gonna change the _expression_ to a static value like 10K or 20K to see if this issue still happens or not. 


Thanks & Regards,
Vikrant Aggarwal


On Wed, Jun 26, 2024 at 2:33âPM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
You are correct that the schedd will not start new jobs running if it detects that would cause it to exceed the MAX_JOBS_RUNNING limit. If a bug is causing it to go over the limit, weâd want to track that down.
The schedd will also decline to start new jobs running if it thinks its cpu utilization is too high (configurable via the CURB_MATCHMAKING config knob). But it wonât kill running jobs for that reason.

If the schedd is running up against the MAX_JOB_RUNNING limit, you should also see the following log message occasionally:

Reached MAX_JOBS_RUNNING: no more can run

Have you seen the "Called preempt( ## )â message? If so, was there any additional text on that line?

 - Jaime

On Jun 24, 2024, at 6:38âPM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Thanks Jamie for getting back to me.

I haven't seen the message about max running jobs.
Aside note I thought sched will never allow more jobs running than the limit at first place, btw this job was considered for preempt after 2-3 hours of runtime. If it's max limit (which I believe it's not) then shouldn't it decide to kill the jobs with less runtime?

Regarding sched overloaded are you  indicating high sched cpu utilisation or something else?


On Mon, Jun 24, 2024, 5:06âPM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
The "Sent vacate commandâ message is written in the SchedLog when the schedd is vacating jobs because itâs shutting down or it believes it has too many jobs running (and consuming too many resources on the scheddâs machine).
If the latter is the case, then you will also see this message: "Preempting ### jobs due to MAX_JOBS_RUNNING changeâ.
You should also see this message: "Called preempt( %d ) ââ.

 - Jaime

On Jun 21, 2024, at 10:14âAM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello Experts,

Any idea what else could cause this issue? 

Thanks & Regards,
Vikrant Aggarwal


On Tue, Jun 18, 2024 at 10:18âAM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
thanks for your response, 

We are not using preemption at all. Even though NEGOTIATOR_CONSIDER_PREEMPTION is true here (on worker nodes) but on negotiator this setting is False. 

# condor_config_val NEGOTIATOR_CONSIDER_PREEMPTION PREEMPTION_RANK RANK PREEMPTION_REQUIREMENTS
true
(RemoteUserPrio * 1000000) - ifThenElse(isUndefined(TotalJobRuntime), 0, TotalJobRuntime)
0
False


Thanks & Regards,
Vikrant Aggarwal


On Mon, Jun 17, 2024 at 5:00âPM Zach McGrew <mcgrewz@xxxxxxx> wrote:
Are you using Rank based preemption?

Is that EP's RANK set to something that could evaluate higher for one job compared to another, or is it a constant?

-Zach

________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Monday, June 17, 2024 10:47 AM
To: HTCondor-Users Mail List
Subject: [HTCondor-users] Sent vacate command to host for killing the job.

You don't often get email from ervikrant06@xxxxxxxxx. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hello Experts,

We are seeing a few jobs in the batch randomly getting killed, following messages reported in the shadow log file. We are not sending any vacate command:

/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Sent vacate command to <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for job 2401567.2308
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Shadow pid 1785748 for job 2401567.2308 exited with status 107
/var/log/condor/SchedLog:06/14/24 15:57:45 (pid:10490) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxx> <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for user1, 2401567.2308) deleted
/var/log/condor/SchedLog:06/14/24 16:02:49 (pid:10490) Starting add_shadow_birthdate(2401567.2308)
/var/log/condor/SchedLog:06/14/24 16:02:49 (pid:10490) Started shadow for job 2401567.2308 on slot1@xxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxx> <xx.xx.48.122:9618?addrs=xx.xx.48.122-9618&alias=aresnode0121test.com<http://aresnode0121test.com/>&noUDP&sock=startd_24441_1251> for user1, (shadow pid = 2178208)
/var/log/condor/SchedLog.old:06/14/24 12:13:55 (pid:10490) job_transforms for 2401567.2308: 1 considered, 1 applied (SetTeam)
/var/log/condor/SchedLog.old:06/14/24 12:14:44 (pid:10490) Starting add_shadow_birthdate(2401567.2308)
/var/log/condor/SchedLog.old:06/14/24 12:14:44 (pid:10490) Started shadow for job 2401567.2308 on slot1@xxxxxxxxxxxxxxxxxxxxxx<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxx> <xx.xx.87.172:9618?addrs=xx.xx.87.172-9618&alias=condornode2952test.com<http://condornode2952test.com/>&noUDP&sock=startd_58327_0c8e> for user1, (shadow pid = 1785748)


StartLog on worker node shows.. slot logs don't show anything useful except signal 15.

06/14/24 15:57:45 slot1_46: Called deactivate_claim()
06/14/24 15:57:45 slot1_46: Called deactivate_claim()
06/14/24 15:57:45 slot1_46: Changing state and activity: Claimed/Busy -> Preempting/Vacating
06/14/24 15:57:45 slot1_46: State change: starter exited
06/14/24 15:57:45 slot1_46: State change: No preempting claim, returning to owner
06/14/24 15:57:45 slot1_46: Changing state and activity: Preempting/Vacating -> Owner/Idle
06/14/24 15:57:45 slot1_46: State change: IS_OWNER is false
06/14/24 15:57:45 slot1_46: Changing state: Owner -> Unclaimed
06/14/24 15:57:45 slot1_46: Changing state: Unclaimed -> Delete
06/14/24 15:57:45 slot1_46: Resource no longer needed, deleting
06/14/24 15:57:54 slot1_46: New machine resource of type -1 allocated

We are not taking any manual or automatic (periodic release) action on the jobs. Preemption is disabled in our cluster on the Negotiator node, Any thoughts what could cause this issue?


Thanks & Regards,
Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/