Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Failing Jobs on hold due to output missing
- Date: Wed, 27 Mar 2013 15:39:24 +0100
- From: Max Fischer <mfischer@xxxxxxxxxxxxxxxxxxxx>
- Subject: [HTCondor-users] Failing Jobs on hold due to output missing
Hi,
I have a problem with HTCondor's transfer_output_files and hold mechanic
with failing jobs. We often have some very big temporary files in our
jobs, so specifying the actual output files is a must for us in these
cases. However, when a job fails before creating the output file, the
STARTER subsequently fails to transfer the "requested" output file and
as a result the job is put on hold. [1] I realize this is a documented
behavior. [2]
As we have a multi-backend job manager wrapped around Condor, this is
not exactly optimal for us. It masks what is an error in the job itself
as an error of Condor. We could catch the error but this requires
separate handling for jobs with transfer_output and those without it. It
would be much easier if we could have Condor treat the job as Completed
regardless of whether all files exist or define individual files as
optional. Is there any way to do this?
Cheers,
Max
[1]
8928.89 mfischer 3/27 14:49 Error from [WORKERNODE]: STARTER at
[WORKERNODE] failed to send file(s) to <[SCHEDD]:9615>: error reading
from
/home/cmsusr189/home_cream_084391945/glide_p21124/execute/dir_20515/cmssw.log.gz:
(errno 2) No such file or directory; SHADOW failed to receive file(s)
from <[WORKERNODE]:52719>
[2]
http://research.cs.wisc.edu/htcondor/manual/current/2_5Submitting_Job.html#SECTION00354200000000000000