Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] how to kill job when output dir removed ?
- Date: Wed, 17 Jan 2007 11:03:19 -0500
- From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
- Subject: Re: [Condor-users] how to kill job when output dir removed ?
> Now you can construct an automated policy about what should happen to
the
> held jobs by configuring SYSTEM_PERIODIC_RELEASE or
> SYSTEM_PERIODIC_REMOVE (or the user can configure a policy with the
> respective job policy expressions).
Ah yes! Good point. We don't use a try-again policy at Altera but you
can definitely couple a release and remove config setting with the retry
attempt counter in the jobs to limit the number of times Condor would
try and recover from a missing result directory error. Just in case it's
transient. Personally I find stupid users to be rather stable in their
continued existence. :)
> I should also note that there was one case of file transfer errors not
> handled by 6.8's put-on-hold policy. It is failure while writing the
> output to the submit machine (e.g. because the disk is full). This
has
> been fixed in Condor 6.9.1, so jobs will go on hold in this case too.
> It was difficult to judge whether this was a bug fix or a change in
> behavior (i.e. suitable for 6.8 vs. 6.9). In the end, I decided to
put
> it into 6.9.
If you're accepting lobbying for pushing this into 6.8 consider this my
request. I'd definitely like to see this as a put-on-hold situation.
Right now we're actually monitoring user's disk usage on our NAS box and
holding their jobs when they get with 1% of their hard quota.
- Ian
P.S. Did something change on the mailing list? I'm not seeing my own
emails to the list anymore.