On Mon, Nov 03, 2014 at 03:51:43PM +0000, Brian Candler wrote:
(related to my previous post)
If I submit a DAG which uses a http:// URL for an input file, and the
file transfer fails, the job goes into a "hold" state. Is it possible to
configure this so that it fails the node entirely?
If the DAG node failed then the whole DAG would fail, and this gets
noticed by the user. However if a job ends up in 'held' state then it's
just as if the job is taking forever to run, and needs additional
monitoring to check.
I would look into setting "periodic_remove" in your job submit file.  You can
condition it to look for the proper HoldReasonCode (that shows file transfer
has failed, and not some other reason).  I'll defer to Kent Wenger on this, but
I believe if a job gets removed it causes the DAG to fail.