[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] problems with Dagman status files



Hi Stefano,

  1. When DAGMan goes to write the final node status on the way out the door, it checks if its state has changed before actually writing the node status file. In this case it appears that DAGMan has determined that its state had not changed since the prior writing of the DAG node status file. If you want to have DAGMan to re-write the node status file at exit regardless of state change, you can add the ALWAYS-UPDATE keyword to the declaration in your DAG file.
  2. Points 2 & 3 are due to the fact that DagStatus means different things in the node status file and the metrics file. This discrepancy is something that has sadly existed for a long time, and I don't know the best way to address it.
    1. For the node status file, the DagStatus is equivalent to the value list you provided from our documentation.
    2. For the metrics file, the DagStatus is set to the value of the DAGMan jobs ClassAd attribute DAG_Status.

I will update the documentation to better reflect attempting to write the node status file on exit and warn that DagStatus is different between the two files.

Hope this helps,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Stefano Belforte via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, May 29, 2025 5:22 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Stefano Belforte <stefano.belforte@xxxxxxx>
Subject: [HTCondor-users] problems with Dagman status files
 

Dear experts,

we have hit a few cases of an odd situation (for me at least) with Dagman status.
Can you kindly help me to understand correctly the information there ? Or is there by any chance an inconsistency in the way those are written, or documented ?

There are more details in https://github.com/dmwm/CRABServer/issues/9098 and more can be provided upon request, like full log files or access to the AP where things ran.

Summary is:

this is not the norm, usually things work finely (thanks !)

we had a bad day where for unknown reasons a lot POST scripts failed immediately, leading to nodes being retried (as per dagman description file, all fine there). Due the odd way in which POST failed the PRE script of the retries failed too. Once the max retry for the node was reached the node was failed, but still listed as in PRERUN (!?)

After all affected nodes ended like that, dagman debug file has (trimmed):

05/27/25 01:10:25 Aborting DAG...
[...]
05/27/25 01:10:26 DAG status: 2 (DAG_STATUS_NODE_FAILED)
[...]
05/27/25 01:10:26  Done     Pre   Queued    Post   Ready   Un-Ready   Failed   Futile
05/27/25 01:10:26   ===     ===      ===     ===     ===        ===      ===      ===
05/27/25 01:10:26  2051       0        0       0       0          0       35        0
[...]
05/27/25 01:10:26 Node status file not updated because it is not yet outdated
05/27/25 01:10:26 Wrote metrics file /data/srv/glidecondor/condor_local/spool/1780/0/cluster123521780.proc0.subproc0/RunJobs.dag.metrics.
[...]
05/27/25 01:10:26 **** condor_dagman (condor_DAGMAN) pid 2561844 EXITING WITH STATUS 1

Note the "Node status file not updated" which to me is in contradiction with
the documentation https://htcondor.readthedocs.io/en/latest/automated-workflows/dagman-information-files.html
which states " The status file is also updated a final time when the DAG completes either successfully or not."

Indeed in the status file I have [1] and in metrics file I have [2]

So my questions are:

1. Why was stats file not updated ? What can I do to ensure it is ?

2. Why DAG status is 3 in the status file, but 2 in the metric file ?

3. How do I reconcile "DAG status: 2 (DAG_STATUS_NODE_FAILED)"   with the list of status values in the Dagman documentation ?


As a final note, Dagman wrote a rescue file which looks just fine to me (thanks!) but so fare we have been relying on status file to decide that a dagman could be restarted (called resubmission in CRAB). As long as the node file says STATUS 3 (SUBMITTED) we prevent users from restarting the Dagman.

I apologise if this message is not clear, it is a new situation and I do not fully understand what's happening.

Thanks for you help

Stefano


[1]

[
  Type = "DagStatus";
  DagFiles = {
    "/data/srv/glidecondor/condor_local/spool/1780/0/cluster123521780.proc0.subproc0/RunJobs.dag"
  };
  Timestamp = 1748301015; /* "Tue May 27 01:10:15 2025" */
  DagStatus = 3; /* "STATUS_SUBMITTED ()" */
  NodesTotal = 2086;
  NodesDone = 2051;
  NodesPre = 35;
  NodesQueued = 0;
  NodesPost = 0;

[2]

{
    "client":"condor_dagman",
    "version":"24.0.6",
    "type":"metrics",
    "start_time":1748301012.527,
    "end_time":1748301026.335,
    "duration":13.808,
    "exitcode":1,
    "dagman_id":"123521780",
    "parent_dagman_id":"",
    "rescue_dag_number":0,
    "jobs":2086,
    "jobs_failed":35,
    "jobs_succeeded":2051,
    "dag_jobs":0,
    "dag_jobs_failed":0,
    "dag_jobs_succeeded":0,
    "total_jobs":2086,
    "total_jobs_run":2086,
    "DagStatus":2
}