Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] HTCondor and Docker
- Date: Mon, 13 Apr 2015 17:34:13 +0100
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: Re: [HTCondor-users] HTCondor and Docker
On 13/04/2015 16:08, R. Kent Wenger wrote:
I already get DAGman to retry each node once. I am thinking about
retries which require operator intervention, e.g. because of running
out of disk space or a bad NFS mount.
Hmm, sounds like you might want this feature once it's implemented:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2831,4
Well, I've certainly wanted to do that in the past - when I've noticed a
node failure while the DAG is running (and the retries failed too), but
other jobs are continuing to run happily. Sometimes I've just killed the
whole DAG so that I don't have to wait for it to finish.
So being able to kick off a manual retry would be a good feature.
Another approach I thought of would be for DAGman to delay its retries
until the last possible moment - i.e. when there are no other jobs which
can proceed - instead of retrying as soon as possible. Or perhaps just
the *last* retry should be handled this way.
Anyway... this is just a tweak. The main issue for me is creating a DAG
dynamically (in response to a request received in an AMQP message),
which in turn means a lifecycle of:
* create a working directory
* run the script to create the DAG/submit/input files in this directory
* submit the DAG
* wait for DAG to complete
* send back success/fail message to submitter, and results
* tidy up (i.e. remove the working directory) on DAG success
* on failure, keep all the temp files for post-mortem analysis; after
fixes, resubmit the rescue DAG
* management tools: e.g. list the working directories, clusterID for
running jobs, exit status for finished jobs (eventually a web interface)
I was initially surprised that HTCondor doesn't come with any tooling
for that sort of lifecycle - it seems the assumption is that all
workflows are set up by hand at the CLI.
I did look for HTCondor front-ends; e.g. I found Pegasus, but as far as
I can see, you are still required to create your own working directory
to stick the DAX files into, and to keep track of your submission.
Regards,
Brian.