Hi Stefano,
I will update the documentation to better reflect attempting to write the node status file on exit and warn that DagStatus is different between the two files.
Hope this helps,
Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Stefano Belforte via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, May 29, 2025 5:22 AM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Cc: Stefano Belforte <stefano.belforte@xxxxxxx> Subject: [HTCondor-users] problems with Dagman status files Dear experts, we have hit a few cases of an odd situation (for me at least) with Dagman status. There are more details in https://github.com/dmwm/CRABServer/issues/9098 and more can be provided upon request, like full log files or access to the AP where things ran. Summary is: this is not the norm, usually things work finely (thanks !) we had a bad day where for unknown reasons a lot POST scripts failed immediately, leading to nodes being retried (as per dagman description file, all fine there). Due the odd way in which POST failed the PRE script of the retries failed too. Once the max retry for the node was reached the node was failed, but still listed as in PRERUN (!?) After all affected nodes ended like that, dagman debug file has (trimmed):
Note the "Node status file not updated" which to me is in contradiction with 1. Why was stats file not updated ? What can I do to ensure it is ? 2. Why DAG status is 3 in the status file, but 2 in the metric file ? 3. How do I reconcile "
As a final note, Dagman wrote a rescue file which looks just fine to me (thanks!) but so fare we have been relying on status file to decide that a dagman could be restarted (called resubmission in CRAB). As long as the node file says STATUS 3 (SUBMITTED) we prevent users from restarting the Dagman. I apologise if this message is not clear, it is a new situation and I do not fully understand what's happening. Thanks for you help Stefano
[1]
[2]
|