Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Detailled monitoring of a DAG
- Date: Tue, 31 Aug 2021 08:59:04 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Detailled monitoring of a DAG
On 8/31/21 5:33 AM, Nicolas Arnaud wrote:
Dear all,
What (Python) framework/approach would you recommend to monitor in a
detailled way the running of each DAG instance? Which DAG/blocks/jobs
completed successfully or failed, how long each DAG/block/job took,
why a particular job took that long (evictions, etc.), etc. I would
then use the individual DAG summary data to build long-term
statistics, identify problems in my code or the software environment...
Hi Nicolas:
I don't think there is an existing, comprehensive solution for this
today. The htcondor python bindings have tools to read the job logs
(not the DAG logs, but the job logs), and the job logs are annotated
with the DAG node name, so that might be helpful. Some groups add DAG
node prescript or postscript to explicitly log additional information
about job starts and restarts.
-greg