This morning, we saw a dagman process in our cluster that was stuck in an
âXâ state (via condor_q âdag) after it had been removed (condor_rm <dag
cluster id>). A few of the nodes from the DAG were still running or idle. We
weren't sure why these nodes were still executing after the dagman was
removed; but we think it has something to do with the FINAL node, and the
way the dagman parses the DAG. Since these jobs take a very long time to
complete, this ends up causing us issues with slots being held by jobs that
aren't actually supposed to be running. After we cleaned up these extra
processes, we were able to reproduce this with a simple job.
We create a simple bat script that sleeps for 30min (arbitrary time), which
each node in the dag will use as the executable in the submit file. Our test
DAG had 20 nodes total (also arbitrary), 10 of which were children of the
first 10. The most important part is the FINAL statement at the end of the
DAG. This just sleeps for 15 seconds (also arbitrary, but short enough that
we don't have to wait a long time to watch it finish). When we submit the
DAG, we see a dagman enter a running state, then begin submitting the nodes
of the DAG. If we kill this off after a certain amount of time (we weren't
able to figure this out, but it is likely less than 5-10sec after the first
nodes are submitted), the dagman exits and doesn't run the workflow; and we
don't see any processes remaining from the DAG. If we wait a bit longer
(maybe >10sec), we can do a condor_rm to put the dagman into an "X" state,
and any running and/or idle nodes will be removed. This is where the problem
occurs. We expect this to happen, and we expect the FINAL node to be queued,
execute, and return; killing the rest of the workflow. What actually happens
is that the FINAL node is queued, and any running/idle jobs that were in the
queue when dagman was removed are *also* queued. I'm not certain if this is
expected behavior, but we didn't anticipate it when removing a dagman; we
assumed every node would be removed. We also noticed that these processes
continue running after the FINAL node exits.
We think this has something to do with the FINAL node, since we can't
reproduce the issue without it. Also, since we don't see it if we kill the
dagman early enough in the workflow, we think that the FINAL node might not
be evaluated right away; maybe the DAG is still being parsed when the first
set up jobs are queued?