Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Job Attributes and Job Policy Expressions
- Date: Mon, 12 Jul 2010 11:38:49 +0800
- From: <Greg.Hitchen@xxxxxxxx>
- Subject: [Condor-users] Job Attributes and Job Policy Expressions
Hi
All
Is anyone aware of
anything documenting job attributes, particularly in
relation to what
attributes are available at what times? e.g. JobStartDate
obviously won't
appear until a job has transitioned from idle to running.
It is possible to
use "condor_q -l" to see a job's attributes but I was hoping
for a listing of ALL
possible attributes and when they are "available".
The reason being
that I have been fiddling with some job policy expressions
to "overcome" some
issues we have on occasion when submitting jobs.
e.g. some jobs
exiting too early and some seeming to run forever. If we
manually
resubmit
the "too early" jobs then they seem to mostly run OK.
Manually putting the
"run forever" jobs on hold and then manually releasing
them also
causes them to mostly run OK. This can be a labourious
process with 10,000+
submitted jobs, so we were looking at a way to make
this happen
automatically using on_exit_remove, periodic_hold, etc.
I now have something
that seems to work for us but it was a bit of a trial and
error process as
some of the existing docs/examples don't seem to work?
(as the
attribute
doesn't exist, i.e. is not defined) and even some of the
attributes
seen with "condor_q
-l" give "undefined" errors.
e.g. the
docs/example give one like:
== False) && (ExitSignal != 0)) || (ServerStartTime -
JobStartdate < 3600 )
As far as I can tell
there is no ServerStartTime, there is however a ServerTime
but even reference
to that says it is undefined, yet I can see it with condor_q
-l
BTW this is for
windows version 7.2.4
Our trial and error
solution gave us the following, which seems to work
OK for our
purposes. This particular test setup is for jobs that should
run
for 20 minutes, any
less than this or more than this by 5 mins means
something dodgy has
happened so we want to try re-running the job.
MINUTE = 60
- JobCurrentStartDate)
> (15 * $(MINUTE))
periodic_hold = (CurrentTime - JobCurrentStartDate)
> (30 * $(MINUTE))
periodic_release = (CurrentTime -
EnteredCurrentStatus) > (5 * $(MINUTE))
Thanks for any
help
Cheers
Greg