[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unreliable job start count



JobRunCount is (should be) updated when the schedd starts a new shadow. It does _not_ count anything else, specifically including the number of times the job's executable was started.

REFUSED message also comes at that same timestamp - maybe the starter isnʼt dead enough yet to accept another job?

There is and has been for quite some time a race condition between job termination and the slot becoming available, mostly due to how long it takes the starter to clean up the execute directory.

This error is relatively new; looks like it started in September, see below.

Did you change the installed version of HTCondor around that time, and if so, from which version to which version? I'm not aware of any changes we've made that would make the race condition worse or more likely, but if you have a specific pair of version to compare, maybe we can find something. And then for the other side of the equation: as far you know, neither the jobs you're running nor the underlying system(s) changed around then, either?

-- ToddM