Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Dagman Newbie Questions
- Date: Fri, 10 Oct 2008 10:33:31 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Dagman Newbie Questions
On Thu, 9 Oct 2008, Jeremy Yabrow wrote:
It seems that condor_q only shows the dag jobs that running or that CAN
run. Jobs that are blocked due to prequisite jobs not done yet, are not
shown. Also, if I keep the jobs in the system and attempt to re-run
them (using condor_hold & condor_release), then the dependencies are
not obeyed during this subsequent "run"
Yes, that's right. Once a job is submitted, DAGMan doesn't do anything to
it (besides removing it if you remove the condor_dagman job itself).
If I understand what's going on, the dagman job appears to be simply a
job that submits other jobs, and the downstream jobs are not even
submitted until their prerequisite jobs have run. Subsequent runs can
only be run again with the .rescue file. Is this correct?
Well, the first part of this is basically correct. DAGMan does a little
more than just submitting jobs, but yes, jobs whose prerequisities are not
satisfied are not submitted to Condor.
As far as the rescue DAG goes, though, you only get a rescue DAG if the
workflow fails or if you condor_rm the condor_dagman job. If you do run
a rescue DAG, you don't re-run all of the jobs, only the ones that didn't
finish (or were not run at all) the first time around.
If you want to re-run a DAG from scratch, you need to do
condor_submit_dag -f <whatever>.dag
This will re-run the whole DAG regardless of whether it succeeded the
first time.
This has consequences for us because in our business, deadlines are
critical and resource utilization must be maximized. So progressive
estimates of completion and remaining work are necessary. We need all
nodes of the entire DAG to be present in the system to estimate resource
use, even though many of the nodes may be blocked waiting for
prerequisite nodes. Jobs that submit other jobs are a nasty surprise
for our resource managers. Also re-running of any node and its
dependent nodes is fairly common and is often done many times during
pipeline troubleshooting-we don't want to have to re-submit the entire
DAG several times in separate runs because there may be long-running
nodes in the DAG we want to continue in parallel while we're working on
other "broken" nodes.
As far as estimates of completion go, you can get some idea by looking at
the dagman.out file, where you'll find periodic updates like this:
7/10 17:20:40 Done Pre Queued Post Ready Un-Ready Failed
7/10 17:20:40 === === === === === === ===
7/10 17:20:40 1 0 0 0 2 1 0
Of course, this doesn't account for differences in resource usage between
different nodes of the DAG.
You can have DAGMan re-try nodes if you want, but that doesn't allow you
to manually re-start failed nodes.
If you absolutely have to have all of the jobs in the queue, though, I
don't see any way to do this with DAGMan.
It looks to me like we'd have a hard time getting Condor/Dagman to
support these needs. I'd love any advice / comments you might have on
this.
Well, having DAGMan submit all of the jobs and then put them on hold if
they're not ready (or something along those lines) would be a really
fundamental change in DAGMan, and I don't see that happening. (And for
many users, the fact that not all of the jobs go into the queue right
away is a benefit, because it decreases the load on the Condor central
manager and the schedd.)
Sorry I haven't been of more help...
Kent Wenger
Condor Team