|
Hi all, At CERN (running v24.0.7 in schedds, 24.0.3 in startds), we have observed that sometimes jobs scheduled to run on a worker node do not really start, apparently due to a communication problem between startd and schedd,
and after some time they are rescheduled somewhere else. Looking at our logs, this seems to happen with certain (low) frequency in our schedds. In many cases, this is shown by
StartLog showing lines like âClaimId [...] not foundâ, and at the same time the Schedd showing âFailed to get address of starter for this jobâ in /var/log/messages. Today I encountered a scenario where this was happening repeatedly in one of our schedds. For a couple of hours whenever I submitted a job to it, the job would soon be marked as Running, even if it hadnât actually started
(condor_q -run showed no assigned worker node and
RemoteHost classad was undef). If I could find the worker node where it was supposed to be running, I would find the above complaints in the logs. The job was later moved to a different worker
node, where the same happened. Restarting and even rebooting the schedd didnât help. Eventually the schedd seemed to fix by itself and the jobs were finally run in the latest worker node they had been scheduled to. During this episode, I also found many messages like âfailed to connect schedd_xxxx_yyyy as requested by STARTDâ in the SharedPortLog of the schedd. E.g.: 02/20/26 11:43:18 SharedPortServer: server was busy, failed to connect schedd_3401_61b0 as requested by STARTD <188.185.208.202:9618?addrs=188.185.208.202-9618+[2001-1458-303-10--100-47]-9618&alias=b9p10p7343.cern.ch&noUDP&sock=startd_2536231_16ff>
on <188.185.208.202:44263>: primary (<cookie>/schedd_3401_61b0): Connection refused (111); alt (/var/lock/condor/daemon_sock/schedd_3401_61b0): Connection refused (111) But Iâm not sure if this occurs in all cases when âFailed to get address of starter for this jobâ is shown, and I think it also occurs in other scenarios, when this problem is not happening. It seems that for some reason the schedd have sometimes problems communicating with startds (probably not always as consistently occurring to all jobs as for the case I saw today). Has anybody seen this also? Any idea if this could be caused
by some misconfiguration in the Schedd (not enough threads/sockets or the like)? Whatâs more puzzling for me is why it was suddenly fixed. Cheers, Antonio |