1 - dagman appears to exit for no reason, with no errors in any logs that I can find 2 - after recovering, dagman hangs after all the nodes have been submitted and completed
3 - the delay in dagman recovering is nearly 1 hourI've attached logs, examples dags and subs that will hopefully be enough for someone to understand what is going on. To fill in more of the details.....
The main dag is called "master.dag" it consists of anywhere from 1 to a couple of thousand of independent nodes that are in themselves dags. A couple of lines out of master.dag look like:
JOB eali01nondemo.0 eali01nondemo.0/testcase.dag.condor.sub
JOB eana01non3rdord.0 eana01non3rdord.0/testcase.dag.condor.sub
JOB eana02nontracea.0 eana02nontracea.0/testcase.dag.condor.sub
JOB eana03nonssrayt.0 eana03nonssrayt.0/testcase.dag.condor.sub
...and so on...
The attached example has 160 nodes like this. Each
testcase.dag.condor.sub is another dag, see testcase.dag for an
example, that has a single node with a PRE and a POST script. These
dags, including the node, pre and post scripts, only take about a
minute to complete. The .sub files for the dag (see
testcase.dag.condor.sub) are edited to send their dagman logs to a
common log, orats.dagman.log which is also included. The output from
dagman, master.dag.dagman.out, is included and you will note that the
run starts at 08:03:50 and at 08:10:09 node eenv01nonstackd.0 is
submitted, but output stops until 09:04:22 when a new STARTING UP
header appears and dagman attempt to recover. Upon completing
recovery, the first node submitted is the same eenv01nonstackd.0 as
clusterID 30068.0. Shortly thereafter, the following appears in the log:
09:10:02 ERROR: node eenv01nonstackd.0: job ID in userlog submit event (30069.0) doesn't match ID reported earlier by submit command (30068.0)! Trusting the userlog for now, but this is scary!
In the common log, orats.dagman.log you can see that the node eenv01nonstackd.0 is started twice, once with cluster id 30068.0 and once with 30069.0. Both of these are after 9:00.
If other logs, or info would be useful, I can supply them. Thanks in advance for any help that anyone can offer. Bob Mortensen
Attachment:
dagman.zip
Description: Zip archive