[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Shadow exit with status 108 (Condor Version 9.0.17)



Youâll want to look in the StartLog (search for âslot1_92â) for why the job was refused.

 - Jaime

On Dec 12, 2023, at 2:03âPM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Thanks Jamie for your reply.

I forgot about shadow logs, yes, they do give the indication on which slot job wanted to run. 

12/12/23 11:15:30 (12104310.105) (3126758): RemoteResource::killStarter(): Could not send command to startd
12/12/23 11:15:30 (12104310.105) (3126758): Job 12104310.105 terminated: exited with status 1
12/12/23 11:15:30 (12104310.105) (3126758): Reporting job exit reason 100 and attempting to fetch new job.
12/12/23 11:15:30 (12104310.105) (3126758): Switching to new job 12104372.1
12/12/23 11:15:30 (?.?) (3126758): Initializing a VANILLA shadow for job 12104372.1
12/12/23 11:15:30 (12104372.1) (3126758): LIMIT_DIRECTORY_ACCESS = <unset>
12/12/23 11:15:30 (12104372.1) (3126758): Request to run on slot1_92@xxxxxxxxxxxxxxxx <xx.xx.xx.xx:9618?addrs=xx.xx.xx.xx-9618&alias=test.example.com&noUDP&sock=startd_692440_d2e3> was REFUSED
12/12/23 11:15:30 (12104372.1) (3126758): Job 12104372.1 is being evicted from slot1_92@xxxxxxxxxxxxxxxx
12/12/23 11:15:30 (12104372.1) (3126758): RemoteResource::killStarter(): Could not send command to startd
12/12/23 11:15:30 (12104372.1) (3126758): logEvictEvent with unknown reason (108), not logging.
12/12/23 11:15:30 (12104372.1) (3126758): **** condor_shadow (condor_SHADOW) pid 3126758 EXITING WITH STATUS 108

Checked the slot logs on worker node:

Existing job (12104310.105 shown in above messages also) using this slot was started at 12/12/23 10:55:36 and completed at 12/12/23 11:15:28 as per slot logs.

12/12/23 11:15:28 (pid:1702347) **** condor_starter (condor_STARTER) pid 1702347 EXITING WITH STATUS 0

Next job 24380413.26 from different submit machine started on this slot was 12/12/23 11:37:19

Don't understand why it doesn't allow the job 12104372.1 to run on slot 1_92? 

No indication of JOB ID in slot logs. 

We are using dynamic partitionable slots. 



Thanks & Regards,
Vikrant Aggarwal


On Tue, Dec 12, 2023 at 2:41âPM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Youâll have to look at the ShadowLog on the submit node and the StartLog on the execute node to get details on why the attempts to claim failed. This should be be happening frequently in normal circumstances.

One explanation, which you hint at, is that the slot is being matched to another job before the schedd can finish fully claiming it. This can happen if the negotiator runs a new matchmaking cycle quickly (while the schedd is still trying to start additional jobs on a partitionable slot).

 - Jaime

On Dec 12, 2023, at 12:20âPM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello Experts,

108 JOB_NOT_STARTED Can't connect to startd or request refused

With scheduler level splitting, we are noticing JobRunCount keep on increasing because of 108 ERROR. Happening often on various submit nodes. It doesn't have an impact on the job runtime, is't expected to see this happening frequently? 

Looks like that some other job used the claim. That's why this job was forced to delete the claim? As the job hasn't started running yet hence nothing conclusive can be determined from worker node logs. 


12/12/23 11:06:25 (pid:551974) job_transforms for 12104372.1: 2 considered, 2 applied (SetTeam,SetWaitForSec)
12/12/23 11:15:30 (pid:551974) match (slot1@xxxxxxxxxxxxxxxxxxx <xx.xx.140.191:9618?addrs=xx.xx.140.191-9618&alias=test155.example.com&noUDP&sock=startd_692440_d2e3> for testuser) switching to job 12104372.1
12/12/23 11:15:30 (pid:551974) Shadow pid 3126758 switching to job 12104372.1.
12/12/23 11:15:30 (pid:551974) Starting add_shadow_birthdate(12104372.1)
12/12/23 11:15:30 (pid:551974) Match record (slot1@xxxxxxxxxxxxxxxxxxx <xx.xx.140.191:9618?addrs=xx.xx.140.191-9618&alias=test155.example.com&noUDP&sock=startd_692440_d2e3> for testuser, 12104372.1) deleted
12/12/23 11:15:30 (pid:551974) Shadow pid 3126758 for job 12104372.1 exited with status 108
12/12/23 11:23:11 (pid:551974) Starting add_shadow_birthdate(12104372.1)
12/12/23 11:23:11 (pid:551974) Started shadow for job 12104372.1 on slot1@xxxxxxxxxxxxxxxxxxx <xx.xx.xx.xx:9618?addrs=xx.xx.140.126-9618&alias=test126.example.com&noUDP&sock=startd_3028316_784f> for testuser, (shadow pid = 3159624)
12/12/23 11:54:37 (pid:551974) Shadow pid 3159624 for job 12104372.1 exited with status 115
12/12/23 11:54:37 (pid:551974) Match record (slot1@xxxxxxxxxxxxxxxxxxx <xx.xx.xx.xx:9618?addrs=xx.xx.140.126-9618&alias=test126.example.com&noUDP&sock=startd_3028316_784f> for testuser, 12104372.1) deleted



Thanks & Regards,
Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/