Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Detailled monitoring of a DAG
- Date: Tue, 31 Aug 2021 12:33:23 +0200
- From: Nicolas Arnaud <narnaud@xxxxxxxxxxxx>
- Subject: [HTCondor-users] Detailled monitoring of a DAG
Dear all,
I have a DAG containing ~30 parallel "blocks", including each 3-4 jobs
connected by parent-child links. That DAG could be triggered
automatically a dozen times per day or so and would run each time on
different "live" data.
What (Python) framework/approach would you recommend to monitor in a
detailled way the running of each DAG instance? Which DAG/blocks/jobs
completed successfully or failed, how long each DAG/block/job took, why
a particular job took that long (evictions, etc.), etc. I would then use
the individual DAG summary data to build long-term statistics, identify
problems in my code or the software environment...
All that information is available combining the .dag and .dag.dagman.out
files: are there existing tools that parse these and could be directly
used for or adapted to this goal?
Thanks in advance for your advices,
Nicolas
--
============================================
= Nicolas ARNAUD =
= =
= Laboratoire de physique des deux infinis =
= IrÃne Joliot-Curie (IJCLab) =
= CNRS/IN2P3 & Università Paris-Saclay =
= =
= Virgo Experiment =
= =
= European Gravitational Observatory (EGO) =
= Via E. Amaldi, 5 =
= 56021 Santo Stefano a Macerata =
= Cascina (PI) -- Italia =
= Tel: + 39 050 752 314 =
============================================