[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Lost slot claims



Hi Jason,

 

Thanks, but the RecentDaemonCoreDutyCycle is monitored and the values were always below 50%. We might speculate that the monitoring has lost some spikes, but the schedd was basically non-functional (not starting jobs as I described) for a couple of hours, and the duty cycle was clearly not high during all that time (the schedd was totally responsive to submit/query commands).

 

Cheers,

   Antonio

 

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jason Patton via HTCondor-users
Sent: Friday, February 20, 2026 9:21 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Jason Patton <jpatton@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Lost slot claims

 

Hi Antonio,

 

The shared port daemon struggling to contact the schedd does seem worrying. Do you happen to know what the RecentDaemonCoreDutyCycle ("DCDC") was for your schedd around the time this was happening (condor_status -schedd -af Name RecentDaemonCoreDutyCycle), or maybe the approximate number of shadows that were running? A DCDC around 0.95 or higher could indicate that the schedd is overloaded.

 

Jason

 

On Fri, Feb 20, 2026 at 8:31âAM Antonio Delgado Peris via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Hi all,

 

At CERN (running v24.0.7 in schedds, 24.0.3 in startds), we have observed that sometimes jobs scheduled to run on a worker node do not really start, apparently due to a communication problem between startd and schedd, and after some time they are rescheduled somewhere else. Looking at our logs, this seems to happen with certain (low) frequency in our schedds. In many cases, this is shown by StartLog showing lines like âClaimId [...] not foundâ, and at the same time the Schedd showing âFailed to get address of starter for this jobâ in /var/log/messages.

 

Today I encountered a scenario where this was happening repeatedly in one of our schedds. For a couple of hours whenever I submitted a job to it, the job would soon be marked as Running, even if it hadnât actually started (condor_q -run showed no assigned worker node and RemoteHost classad was undef). If I could find the worker node where it was supposed to be running, I would find the above complaints in the logs. The job was later moved to a different worker node, where the same happened. Restarting and even rebooting the schedd didnât help. Eventually the schedd seemed to fix by itself and the jobs were finally run in the latest worker node they had been scheduled to.

 

During this episode, I also found many messages like âfailed to connect schedd_xxxx_yyyy as requested by STARTDâ in the SharedPortLog of the schedd. E.g.:

 

02/20/26 11:43:18 SharedPortServer: server was busy, failed to connect schedd_3401_61b0 as requested by STARTD <188.185.208.202:9618?addrs=188.185.208.202-9618+[2001-1458-303-10--100-47]-9618&alias=b9p10p7343.cern.ch&noUDP&sock=startd_2536231_16ff> on <188.185.208.202:44263>: primary (<cookie>/schedd_3401_61b0): Connection refused (111); alt (/var/lock/condor/daemon_sock/schedd_3401_61b0): Connection refused (111)

 

But Iâm not sure if this occurs in all cases when âFailed to get address of starter for this jobâ is shown, and I think it also occurs in other scenarios, when this problem is not happening.

 

It seems that for some reason the schedd have sometimes problems communicating with startds (probably not always as consistently occurring to all jobs as for the case I saw today). Has anybody seen this also? Any idea if this could be caused by some misconfiguration in the Schedd (not enough threads/sockets or the like)? Whatâs more puzzling for me is why it was suddenly fixed.

 

Cheers,

    Antonio

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/