Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Getting DAG node to fail on file transfer error
- Date: Mon, 03 Nov 2014 15:51:43 +0000
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: [HTCondor-users] Getting DAG node to fail on file transfer error
(related to my previous post)
If I submit a DAG which uses a http:// URL for an input file, and the
file transfer fails, the job goes into a "hold" state. Is it possible to
configure this so that it fails the node entirely?
If the DAG node failed then the whole DAG would fail, and this gets
noticed by the user. However if a job ends up in 'held' state then it's
just as if the job is taking forever to run, and needs additional
monitoring to check.
Example:
$ condor_q -analyze 181.0
-- Submitter: test.example.net : <10.0.2.15:60831> : test.example.net
---
181.000: Request is held.
Hold reason: Error from test.example.net: STARTER at 192.168.56.15
failed to receive file /var/lib/condor/execute/dir_17283/xxxx.xxxx:
FILETRANSFER:1:non-zero exit(1792) from /usr/lib/condor/libexec/curl_plugin
I've checked
http://research.cs.wisc.edu/htcondor/manual/current/condor_submit_dag.html
http://research.cs.wisc.edu/htcondor/manual/current/2_10DAGMan_Applications.html
and I can't see how to fail a held node (or fail a node where the file
transfer fails), although I can see that the node_status_file and
jobstate_log do indicate events for held nodes.
Thanks,
Brian Candler.