[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unreliable job start count



Hi Todd,

03/20/25 12:09:50 ** $CondorVersion: 24.6.0 2025-03-05 BuildID: 790852 PackageID: 24.6.0-1 $
06/06/25 16:53:09 ** $CondorVersion: 24.6.0 2025-03-05 BuildID: 790852 PackageID: 24.6.0-1 $
08/27/25 11:06:54 ** $CondorVersion: 24.11.2 2025-08-21 BuildID: 827416 PackageID: 24.11.2-1 GitSHA: a2297d25 $
08/27/25 14:07:20 ** $CondorVersion: 24.11.2 2025-08-21 BuildID: 827416 PackageID: 24.11.2-1 GitSHA: a2297d25 $

This difference tracks with my experience - I didnât notice the weird behavior before September.

About the jobs : the jobs are changing all the time, itâs entirely possible that this could be a contributing factor, and itâs impossible to track down exactly what changed in the job population, due to our very diverse user community.

I thought it smelled like a race condition, letâs leave it at that and hope that the version info above helps you out.  Thanks for thinking along with us.

JT




On 10 Oct 2025, at 18:27, Todd L Miller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

JobRunCount is (should be) updated when the schedd starts a new shadow.  It does _not_ count anything else, specifically including the number of times the job's executable was started.

REFUSED message also comes at that same timestamp - maybe the starter isnât dead enough yet to accept another job?

There is and has been for quite some time a race condition between job termination and the slot becoming available, mostly due to how long it takes the starter to clean up the execute directory.

This error is relatively new; looks like it started in September, see below.

Did you change the installed version of HTCondor around that time, and if so, from which version to which version?  I'm not aware of any changes we've made that would make the race condition worse or more likely, but if you have a specific pair of version to compare, maybe we can find something.  And then for the other side of the equation: as far you know, neither the jobs you're running nor the underlying system(s) changed around then, either?

-- ToddM_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/