Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unreliable job start count

Date: Fri, 10 Oct 2025 11:27:32 -0500 (CDT)
From: Todd L Miller <tlmiller@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Unreliable job start count

JobRunCount is (should be) updated when the schedd starts a newshadow. It does _not_ count anything else, specifically including thenumber of times the job's executable was started.

REFUSED message also comes at that same timestamp - maybe the starterisnʼt dead enough yet to accept another job?

There is and has been for quite some time a race condition betweenjob termination and the slot becoming available, mostly due to how long ittakes the starter to clean up the execute directory.

This error is relatively new; looks like it started in September, see below.

Did you change the installed version of HTCondor around that time,and if so, from which version to which version? I'm not aware of anychanges we've made that would make the race condition worse or morelikely, but if you have a specific pair of version to compare, maybe wecan find something. And then for the other side of the equation: as faryou know, neither the jobs you're running nor the underlying system(s)changed around then, either?


-- ToddM

Follow-Ups:
- Re: [HTCondor-users] Unreliable job start count
  - From: Jeff Templon

References:
- [HTCondor-users] Unreliable job start count
  - From: Jeff Templon

Prev by Date: [HTCondor-users] Unable to edit jobs as QUEUE_SUPER_USER
Next by Date: Re: [HTCondor-users] HTCondor 24 and STARTER_HIDE_GPU_DEVICES
Previous by thread: [HTCondor-users] Unreliable job start count
Next by thread: Re: [HTCondor-users] Unreliable job start count
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Unreliable job start count