HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] ATTR_JOB_RUN_COUNT



Hello,

It turns out that when the schedd runs a job, it increments the 
ATTR_JOB_RUN_COUNT when adding the birthday record for a shadow.

I think there are two problems with the current implementation.

1) In the case of a disconnected starter/shadow, the job hadn't actually
died, but the starting of a new shadow increments the ATTR_JOB_RUN_COUNT
erroneously. You can see this behavior in schedd.C:add_shadow_birthdate().

2) If the shadow is started, but then the startd at the very last minute
refuses the jobh such that the shadow exits with JOB_NOT_STARTED (legal
behavior), then ATTR_JOB_RUN_COUNT is also incremented.

So, it looks like ATTR_JOB_RUN_COUNT is really the count of the times the
shadow had been started on behalf of a job, and not the fact that the job
had entered its main().

This came up because someone had PeriodicRemove = JobRunCount > 1 in their
job policy and it was failing in cases where the job hadn't run at all
(2)--and they would get "submit/execute/abort due to PeriodicRemove"
in their user job logs without the jobs running since event (2) isn't
recorded in th euser log.  While inspecting the use of that call,
I noticed (1).

Is there anyway to say "Dont run the job more than once, but make sure the
job's main() was actually executed."?

Thank you.

-pete