[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] proposed change in DAGMan



On 15/06/2016 20:56, John N Calley wrote:
I think this is an excellent option. I think it would be best for it to be on by default because I think it is most useful for naïve users.
I think the fundamental problem is there doesn't seem to be much 
consistency over which error conditions cause a job to go on hold, and 
which to cause it to fail.
For example, a DAG node which tries to fetch a remote file via HTTP, and 
the file does not exist (404 error), puts the job on hold. If the user 
notices that there is no progress, and they query it, they find:
~~~
4299200.000:  Request is held.

Hold reason: Error from slot1@xxxxxxxxxxxxxxxxxxxx: STARTER at 192.168.6.213
failed to receive file /var/lib/condor/execute/dir_9329/nonexistent:
FILETRANSFER:1:non-zero exit(1792) from /usr/lib/condor/libexec/curl_plugin
~~~

However the user may not notice that the job has gone on hold, and I believe it won't be retried automatically.
So if the aim is to help naïve users, it may be better to treat all such 
errors as job failures. In fact, I can't think of any case where I would 
prefer the job to go on hold rather than immediately fail the DAG node.
In the case of file transfers, a more sophisticated option might be to 
distinguish between temporary errors (e.g. timeout talking to remote 
HTTP server, or 5xx errors) and permanent errors (4xx errors). The 
former could go on hold and retry automatically a few times before 
giving up, while the latter would fail immediately. However that's a 
much more substantial change, and it doesn't solve the more general 
problem of jobs going on hold for other unexpected reasons.
Of course, failing a single DAG node doesn't prevent progress from being 
made in the rest of the DAG, but you can't retry a node until the whole 
DAG run has completed. I seem to remember discussion around a proposed 
feature to signal DAGman to retry failed nodes before the current run is 
complete. That would be a very useful feature, and I think would 
probably cover the use cases for jobs 'on hold' better than today: it 
would work for any node error which you could repair.
Regards,

Brian.