[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] problems with Dagman status files



Dear experts,

we have hit a few cases of an odd situation (for me at least) with Dagman status.
Can you kindly help me to understand correctly the information there ? Or is there by any chance an inconsistency in the way those are written, or documented ?

There are more details in https://github.com/dmwm/CRABServer/issues/9098 and more can be provided upon request, like full log files or access to the AP where things ran.

Summary is:

this is not the norm, usually things work finely (thanks !)

we had a bad day where for unknown reasons a lot POST scripts failed immediately, leading to nodes being retried (as per dagman description file, all fine there). Due the odd way in which POST failed the PRE script of the retries failed too. Once the max retry for the node was reached the node was failed, but still listed as in PRERUN (!?)

After all affected nodes ended like that, dagman debug file has (trimmed):

05/27/25 01:10:25 Aborting DAG...
[...]
05/27/25 01:10:26 DAG status: 2 (DAG_STATUS_NODE_FAILED)
[...]
05/27/25 01:10:26  Done     Pre   Queued    Post   Ready   Un-Ready   Failed   Futile
05/27/25 01:10:26   ===     ===      ===     ===     ===        ===      ===      ===
05/27/25 01:10:26  2051       0        0       0       0          0       35        0
[...]
05/27/25 01:10:26 Node status file not updated because it is not yet outdated
05/27/25 01:10:26 Wrote metrics file /data/srv/glidecondor/condor_local/spool/1780/0/cluster123521780.proc0.subproc0/RunJobs.dag.metrics.
[...]
05/27/25 01:10:26 **** condor_dagman (condor_DAGMAN) pid 2561844 EXITING WITH STATUS 1

Note the "Node status file not updated" which to me is in contradiction with
the documentation https://htcondor.readthedocs.io/en/latest/automated-workflows/dagman-information-files.html
which states " The status file is also updated a final time when the DAG completes either successfully or not."

Indeed in the status file I have [1] and in metrics file I have [2]

So my questions are:

1. Why was stats file not updated ? What can I do to ensure it is ?

2. Why DAG status is 3 in the status file, but 2 in the metric file ?

3. How do I reconcile "DAG status: 2 (DAG_STATUS_NODE_FAILED)"   with the list of status values in the Dagman documentation ?


As a final note, Dagman wrote a rescue file which looks just fine to me (thanks!) but so fare we have been relying on status file to decide that a dagman could be restarted (called resubmission in CRAB). As long as the node file says STATUS 3 (SUBMITTED) we prevent users from restarting the Dagman.

I apologise if this message is not clear, it is a new situation and I do not fully understand what's happening.

Thanks for you help

Stefano


[1]

[
  Type = "DagStatus";
  DagFiles = {
    "/data/srv/glidecondor/condor_local/spool/1780/0/cluster123521780.proc0.subproc0/RunJobs.dag"
  };
  Timestamp = 1748301015; /* "Tue May 27 01:10:15 2025" */
  DagStatus = 3; /* "STATUS_SUBMITTED ()" */
  NodesTotal = 2086;
  NodesDone = 2051;
  NodesPre = 35;
  NodesQueued = 0;
  NodesPost = 0;

[2]

{
    "client":"condor_dagman",
    "version":"24.0.6",
    "type":"metrics",
    "start_time":1748301012.527,
    "end_time":1748301026.335,
    "duration":13.808,
    "exitcode":1,
    "dagman_id":"123521780",
    "parent_dagman_id":"",
    "rescue_dag_number":0,
    "jobs":2086,
    "jobs_failed":35,
    "jobs_succeeded":2051,
    "dag_jobs":0,
    "dag_jobs_failed":0,
    "dag_jobs_succeeded":0,
    "total_jobs":2086,
    "total_jobs_run":2086,
    "DagStatus":2
}