[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor and Docker



On 13/04/2015 16:08, R. Kent Wenger wrote:
I already get DAGman to retry each node once. I am thinking about retries which require operator intervention, e.g. because of running out of disk space or a bad NFS mount.
Hmm, sounds like you might want this feature once it's implemented:

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2831,4
Well, I've certainly wanted to do that in the past - when I've noticed a node failure while the DAG is running (and the retries failed too), but other jobs are continuing to run happily. Sometimes I've just killed the whole DAG so that I don't have to wait for it to finish.
So being able to kick off a manual retry would be a good feature. 
Another approach I thought of would be for DAGman to delay its retries 
until the last possible moment - i.e. when there are no other jobs which 
can proceed - instead of retrying as soon as possible. Or perhaps just 
the *last* retry should be handled this way.
Anyway... this is just a tweak. The main issue for me is creating a DAG 
dynamically (in response to a request received in an AMQP message), 
which in turn means a lifecycle of:
* create a working directory
* run the script to create the DAG/submit/input files in this directory
* submit the DAG
* wait for DAG to complete
* send back success/fail message to submitter, and results
* tidy up (i.e. remove the working directory) on DAG success
* on failure, keep all the temp files for post-mortem analysis; after fixes, resubmit the rescue DAG * management tools: e.g. list the working directories, clusterID for running jobs, exit status for finished jobs (eventually a web interface)
I was initially surprised that HTCondor doesn't come with any tooling 
for that sort of lifecycle - it seems the assumption is that all 
workflows are set up by hand at the CLI.
I did look for HTCondor front-ends; e.g. I found Pegasus, but as far as 
I can see, you are still required to create your own working directory 
to stick the DAX files into, and to keep track of your submission.
Regards,

Brian.