[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-devel] ATTR_JOB_RUN_COUNT
- Date: Tue, 9 Jan 2007 12:02:29 -0600
- From: Peter Keller <psilord@xxxxxxxxxxx>
- Subject: [Condor-devel] ATTR_JOB_RUN_COUNT
Hello,
It turns out that when the schedd runs a job, it increments the
ATTR_JOB_RUN_COUNT when adding the birthday record for a shadow.
I think there are two problems with the current implementation.
1) In the case of a disconnected starter/shadow, the job hadn't actually
died, but the starting of a new shadow increments the ATTR_JOB_RUN_COUNT
erroneously. You can see this behavior in schedd.C:add_shadow_birthdate().
2) If the shadow is started, but then the startd at the very last minute
refuses the jobh such that the shadow exits with JOB_NOT_STARTED (legal
behavior), then ATTR_JOB_RUN_COUNT is also incremented.
So, it looks like ATTR_JOB_RUN_COUNT is really the count of the times the
shadow had been started on behalf of a job, and not the fact that the job
had entered its main().
This came up because someone had PeriodicRemove = JobRunCount > 1 in their
job policy and it was failing in cases where the job hadn't run at all
(2)--and they would get "submit/execute/abort due to PeriodicRemove"
in their user job logs without the jobs running since event (2) isn't
recorded in th euser log. While inspecting the use of that call,
I noticed (1).
Is there anyway to say "Dont run the job more than once, but make sure the
job's main() was actually executed."?
Thank you.
-pete