Dear experts,
we have hit a few cases of an odd situation (for me at least)
with Dagman status.
Can you kindly help me to understand correctly the information
there ? Or is there by any chance an inconsistency in the way
those are written, or documented ?
There are more details in https://github.com/dmwm/CRABServer/issues/9098 and more can be provided upon request, like full log files or access to the AP where things ran.
Summary is:
this is not the norm, usually things work finely (thanks !)
we had a bad day where for unknown reasons a lot POST scripts failed immediately, leading to nodes being retried (as per dagman description file, all fine there). Due the odd way in which POST failed the PRE script of the retries failed too. Once the max retry for the node was reached the node was failed, but still listed as in PRERUN (!?)
After all affected nodes ended like that, dagman debug file has
(trimmed):
05/27/25 01:10:25 Aborting DAG... [...]
05/27/25 01:10:26 DAG status: 2 (DAG_STATUS_NODE_FAILED) [...]
05/27/25 01:10:26 Done Pre Queued Post Ready Un-Ready Failed Futile 05/27/25 01:10:26 === === === === === === === === 05/27/25 01:10:26 2051 0 0 0 0 0 35 0 [...]
05/27/25 01:10:26 Node status file not updated because it is not yet outdated 05/27/25 01:10:26 Wrote metrics file /data/srv/glidecondor/condor_local/spool/1780/0/cluster123521780.proc0.subproc0/RunJobs.dag.metrics.
[...] 05/27/25 01:10:26 **** condor_dagman (condor_DAGMAN) pid 2561844 EXITING WITH STATUS 1
Note the "Node status file not updated" which to me is in
contradiction with
the documentation https://htcondor.readthedocs.io/en/latest/automated-workflows/dagman-information-files.html
which states " The status file is also updated a final time when
the DAG
completes either successfully or not."
Indeed in the status file I have [1] and in metrics file I have
[2]
So my questions are:
1. Why was stats file not updated ? What can I do to ensure it is
?
2. Why DAG status is 3 in the status file, but 2 in the metric file ?
3. How do I reconcile "DAG status: 2
(DAG_STATUS_NODE_FAILED)
" with the list of status
values in the Dagman documentation ?
As a final note, Dagman wrote a rescue file which looks just fine to me (thanks!) but so fare we have been relying on status file to decide that a dagman could be restarted (called resubmission in CRAB). As long as the node file says STATUS 3 (SUBMITTED) we prevent users from restarting the Dagman.
I apologise if this message is not clear, it is a new situation and I do not fully understand what's happening.
Thanks for you help
Stefano
[1]
[
Type = "DagStatus";
DagFiles = {
"/data/srv/glidecondor/condor_local/spool/1780/0/cluster123521780.proc0.subproc0/RunJobs.dag"
};
Timestamp = 1748301015; /* "Tue May 27 01:10:15 2025" */
DagStatus = 3; /* "STATUS_SUBMITTED ()" */
NodesTotal = 2086;
NodesDone = 2051;
NodesPre = 35;
NodesQueued = 0;
NodesPost = 0;
[2]
{
"client":"condor_dagman",
"version":"24.0.6",
"type":"metrics",
"start_time":1748301012.527,
"end_time":1748301026.335,
"duration":13.808,
"exitcode":1,
"dagman_id":"123521780",
"parent_dagman_id":"",
"rescue_dag_number":0,
"jobs":2086,
"jobs_failed":35,
"jobs_succeeded":2051,
"dag_jobs":0,
"dag_jobs_failed":0,
"dag_jobs_succeeded":0,
"total_jobs":2086,
"total_jobs_run":2086,
"DagStatus":2
}