[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Antw: Re: Claimed Slots vs runnning jobs



@ Michael
yes I run condor_q -better-analyze <jobid>
There are less slots to run my jobs. Most of the slots have other jobs running.
That was the reason I compare the number of running jobs and claimed slots.


@ Todd
I compared the running jobs with the claimed slots. So I found a claimed slots where no jobs should run.
The last lines of the StarterLog of this slot are:
06/06/21 09:57:50 (pid:16748) Create_Process succeeded, pid=15548
06/06/21 10:08:02 (pid:16748) condor_read(): Socket closed abnormally when trying to read 5 bytes from <10.78.140.7:9618>, errno=10054
06/06/21 10:08:02 (pid:16748) Lost connection to shadow, no job lease specified

This slot stays in the claimed state.

The related lines of the ShadowsLog are:
06/06/21 09:57:57 (731077.0) (3321901): Switching to new job 731121.0
06/06/21 09:57:57 (?.?) (3321901): Initializing a JAVA shadow for job 731121.0
06/06/21 09:57:57 (731121.0) (3321901): Request to run on slot3@xxxxxxxxxxxxxxxxxxxxxxxxx <192.168.0.80:9618?CCBID=10.78.140.7:9618%3faddrs%3d10.78.140.7-9618%26noUDP%26sock%3dcollector#32189&PrivNet=eal.jku.at&addrs=192.168.0>
06/06/21 09:57:58 (731121.0) (3321901): Request to run on slot3@xxxxxxxxxxxxxxxxxxxxxxxxx <192.168.0.80:9618?CCBID=10.78.140.7:9618%3faddrs%3d10.78.140.7-9618%26noUDP%26sock%3dcollector#32189&PrivNet=eal.jku.at&addrs=192.168.0>
06/06/21 09:57:58 (731121.0) (3321901): File transfer completed successfully.
...
06/06/21 10:10:31 (731121.0) (3321901): condor_read(): Socket closed abnormally when trying to read 5 bytes from startd slot3@xxxxxxxxxxxxxxxxxxxxxxxxx, errno=104 Connection reset by peer
06/06/21 10:10:31 (731121.0) (3321901): ERROR "Can no longer talk to condor_starter <140.78.139.130:55609>" at line 230 in file /var/lib/condor/execute/slot1/dir_24052/userdir/.tmpOxzx17/condor-8.8.12/src/condor_shadow.V6.1/NT>

It seems the slot is claimed and the job was started. Then an error with the connection happens and the slot stays in the claimed state.

Why was the connection terminated and not re-established?
How can I find the reason for the problem?

- Werner
>>> Todd L Miller <tlmiller@xxxxxxxxxxx> 03.06.2021 19:34 >>>
> That's probably the right place to start debugging.

    If jobs are idle but slots are claimed, that probably means that
the negotiator has given slots to jobs -- that is, they match -- but
there's a problem activating the claims.

    You may want to check that the ShadowLog -- if no shadows are
being started, that's one (set of potential) problem(s).  If shadows are
being started but can't start a job, there should be some information in
the ShadowLog about why.

- ToddM
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users


The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/