As Ian Chesal's response indicated, the behavior in Condor 6.8 is to put
jobs on hold when there are errors transferring files. This differs
from earlier versions, which would keep trying to run the job over and
over, often in the vain hopes that it was just a transient error. Now
you can construct an automated policy about what should happen to the
held jobs by configuring SYSTEM_PERIODIC_RELEASE or
SYSTEM_PERIODIC_REMOVE (or the user can configure a policy with the
respective job policy expressions). As Ian also pointed out, it would
be nice to have more control over Condor's emailing, to avoid bombing
users. I have found that, for all practical purposes, any user
operating at large scale simply must set notification = never in their
submit file.
I should also note that there was one case of file transfer errors not
handled by 6.8's put-on-hold policy. It is failure while writing the
output to the submit machine (e.g. because the disk is full). This has
been fixed in Condor 6.9.1, so jobs will go on hold in this case too.
It was difficult to judge whether this was a bug fix or a change in
behavior (i.e. suitable for 6.8 vs. 6.9). In the end, I decided to put
it into 6.9.
--Dan
Dr Ian C. Smith wrote:
Hi,
I have a recurring problem here where our users submit
files through a web interface but then indadvertently
remove the directory the condor input/output files
are sitting in without killing the job first. I've
tried all sort of safeguards to prevent this but they
still seem find a way of doing it (that's users for ya !).
Condor's "try, try and try again" strategy means that
it keeps attempting to write the output files in the hope
that the directory might reappear and deluging my inbox
with error messages in the process.
I can understand that this may have been put in to deal with flakey
NFS filesystems (although I see that Condor tries to avoid
these like the plague now) but is there anyway of getting
condor to just give up if it can't write the output files.
If not can it be set up not to bombard me with e-mail warnings.
On a related point - if I specify that a particular output file
is to be transferred back from the execute host using
transfer_output_files =
and the file isn't there (usually because the executable
has bombed) it just seems to keep on trying in vain.
Anyway to prevent this either ?
regards,
-ian.
--------------------
Dr Ian C. Smith,
The University of Liverpool,
Computer Services Department
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR