Mailing List Archives
	Authenticated access
	
	
     | 
    
	 
	 
     | 
    
	
	 
     | 
  
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Dagman Newbie Questions
- Date: Fri, 10 Oct 2008 10:33:31 -0500 (CDT)
 
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
 
- Subject: Re: [Condor-users] Dagman Newbie Questions
 
On Thu, 9 Oct 2008, Jeremy Yabrow wrote:
It seems that condor_q only shows the dag jobs that running or that CAN 
run.  Jobs that are blocked due to prequisite jobs not done yet, are not 
shown.  Also, if I keep the jobs in the system and attempt to re-run 
them (using condor_hold & condor_release), then the dependencies are 
not obeyed during this subsequent "run"
Yes, that's right.  Once a job is submitted, DAGMan doesn't do anything to 
it (besides removing it if you remove the condor_dagman job itself).
If I understand what's going on, the dagman job appears to be simply a 
job that submits other jobs, and the downstream jobs are not even 
submitted until their prerequisite jobs have run.  Subsequent runs can 
only be run again with the .rescue file.  Is this correct?
Well, the first part of this is basically correct.  DAGMan does a little
more than just submitting jobs, but yes, jobs whose prerequisities are not
satisfied are not submitted to Condor.
As far as the rescue DAG goes, though, you only get a rescue DAG if the 
workflow fails or if you condor_rm the condor_dagman job.  If you do run
a rescue DAG, you don't re-run all of the jobs, only the ones that didn't
finish (or were not run at all) the first time around.
If you want to re-run a DAG from scratch, you need to do
    condor_submit_dag -f <whatever>.dag
This will re-run the whole DAG regardless of whether it succeeded the 
first time.
This has consequences for us because in our business, deadlines are 
critical and resource utilization must be maximized.  So progressive 
estimates of completion and remaining work are necessary.  We need all 
nodes of the entire DAG to be present in the system to estimate resource 
use, even though many of the nodes may be blocked waiting for 
prerequisite nodes.  Jobs that submit other jobs are a nasty surprise 
for our resource managers.  Also re-running of any node and its 
dependent nodes is fairly common and is often done many times during 
pipeline troubleshooting-we don't want to have to re-submit the entire 
DAG several times in separate runs because there may be long-running 
nodes in the DAG we want to continue in parallel while we're working on 
other "broken" nodes.
As far as estimates of completion go, you can get some idea by looking at 
the dagman.out file, where you'll find periodic updates like this:
7/10 17:20:40  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
7/10 17:20:40   ===     ===      ===     ===     ===        ===      ===
7/10 17:20:40     1       0        0       0       2          1        0
Of course, this doesn't account for differences in resource usage between 
different nodes of the DAG.
You can have DAGMan re-try nodes if you want, but that doesn't allow you 
to manually re-start failed nodes.
If you absolutely have to have all of the jobs in the queue, though, I 
don't see any way to do this with DAGMan.
It looks to me like we'd have a hard time getting Condor/Dagman to 
support these needs.  I'd love any advice / comments you might have on 
this.
Well, having DAGMan submit all of the jobs and then put them on hold if 
they're not ready (or something along those lines) would be a really
fundamental change in DAGMan, and I don't see that happening.  (And for
many users, the fact that not all of the jobs go into the queue right
away is a benefit, because it decreases the load on the Condor central 
manager and the schedd.)
Sorry I haven't been of more help...
Kent Wenger
Condor Team