[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor DAG feature request



Hello,

I have a feature request of Condor's DAG system, with respect to handling nested DAGs:
Suppose I have DAG A that calls many DAG B's, and each DAG B runs three 
programs in it, in the order "alpha, beta, gamma".  When gamma fails, 
this causes DAG B to end and generate its own rescue file.  DAG B will 
then tell DAG A about its failure, and DAG A will then generate its own 
rescue file, and the job will stop.
I've noticed that in the case of nested DAGs, DAG A's rescue DAG does 
not point to DAG B's *rescue* file, it instead points to DAG B's 
*submit* file, causing all instances of alpha, beta, and gamma to be 
performed again, instead of just gamma.
I have a system where the "beta" stage of a job is very time-consuming, 
and it is possible that a few "gamma" instances may fail.  It would be 
nice if DAGMan had the ability to detect whether it was running another 
DAG as a sub-job, or just a regular job.  In the case of the former, it 
could intelligently point its own rescue file to the rescue file created 
by the DAG sub-job.
Thanks,

 - Armen

--
Armen Babikyan
MIT Lincoln Laboratory
armenb@xxxxxxxxxx . 781-981-1796