@ Michael yes I run condor_q -better-analyze <jobid> There are less slots to run my jobs. Most of the slots have other jobs running. That was the reason I compare the number of running jobs and claimed slots. @ Todd I compared the running jobs with the claimed slots. So I found a claimed slots where no jobs should run. The last lines of the StarterLog of this slot are: 06/06/21 09:57:50 (pid:16748) Create_Process succeeded, pid=15548 06/06/21 10:08:02 (pid:16748) condor_read(): Socket closed abnormally when trying to read 5 bytes from <10.78.140.7:9618>, errno=10054 06/06/21 10:08:02 (pid:16748) Lost connection to shadow, no job lease specified This slot stays in the claimed state. The related lines of the ShadowsLog are: 06/06/21 09:57:57 (731077.0) (3321901): Switching to new job 731121.0 06/06/21 09:57:57 (?.?) (3321901): Initializing a JAVA shadow for job 731121.0 06/06/21 09:57:57 (731121.0) (3321901): Request to run on slot3@xxxxxxxxxxxxxxxxxxxxxxxxx <192.168.0.80:9618?CCBID=10.78.140.7:9618%3faddrs%3d10.78.140.7-9618%26noUDP%26sock%3dcollector#32189&PrivNet=eal.jku.at&addrs=192.168.0> 06/06/21 09:57:58 (731121.0) (3321901): Request to run on slot3@xxxxxxxxxxxxxxxxxxxxxxxxx <192.168.0.80:9618?CCBID=10.78.140.7:9618%3faddrs%3d10.78.140.7-9618%26noUDP%26sock%3dcollector#32189&PrivNet=eal.jku.at&addrs=192.168.0> 06/06/21 09:57:58 (731121.0) (3321901): File transfer completed successfully. ... 06/06/21 10:10:31 (731121.0) (3321901): condor_read(): Socket closed abnormally when trying to read 5 bytes from startd slot3@xxxxxxxxxxxxxxxxxxxxxxxxx, errno=104 Connection reset by peer 06/06/21 10:10:31 (731121.0) (3321901): ERROR "Can no longer talk to condor_starter <140.78.139.130:55609>" at line 230 in file /var/lib/condor/execute/slot1/dir_24052/userdir/.tmpOxzx17/condor-8.8.12/src/condor_shadow.V6.1/NT> It seems the slot is claimed and the job was started. Then an error with the connection happens and the slot stays in the claimed state. Why was the connection terminated and not re-established? How can I find the reason for the problem? - Werner >>> Todd L Miller <tlmiller@xxxxxxxxxxx> 03.06.2021 19:34 >>> > That's probably the right place to start debugging. If jobs are idle but slots are claimed, that probably means that the negotiator has given slots to jobs -- that is, they match -- but there's a problem activating the claims. You may want to check that the ShadowLog -- if no shadows are being started, that's one (set of potential) problem(s). If shadows are being started but can't start a job, there should be some information in the ShadowLog about why. - ToddM _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-usersThe archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |