Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"
- Date: Tue, 12 Feb 2019 09:42:28 +0100
- From: Nicolas Arnaud <narnaud@xxxxxxxxxxxx>
- Subject: [HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"
Hello,
I'm using a Condor farm to run dags containing a dozen of independent
tasks, each task being made of a few processes running sequentially
following the parent/child logic. Lately I have encountered errors like
the one below:
(...)
02/08/19 00:30:10 Event: ULOG_IMAGE_SIZE for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
02/08/19 00:30:10 Event: ULOG_JOB_TERMINATED for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
02/08/19 00:30:10 Number of idle job procs: 0
02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job proc (281605.0.0) completed successfully.
02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job completed
02/08/19 00:30:10 Event: ULOG_EXECUTE for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:07}
02/08/19 00:30:10 BAD EVENT: job (281605.0.0) executing, total end count != 0 (1)
02/08/19 00:30:10 ERROR: aborting DAG because of bad event (BAD EVENT: job (281605.0.0) executing, total end count != 0 (1))
(...)
02/08/19 00:30:10 ProcessLogEvents() returned false
02/08/19 00:30:10 Aborting DAG...
(...)
Condor correctly asseses one job as being successfully completed but it
seems that it starts executing it again immediately. Then there is a
"BAD EVENT" error and the DAG aborts, killing all the jobs that were
running.
So far this problem seems to occur randomly: some dags complete fine
while, when the problem occurs, the job that suffers from it is
different each time. So are the machine and the slot on which that
particular job is running.
In the above example, the dag snippet is fairly simple
(...)
JOB test_20190208_narnaud_virgo_status virgo_status.sub
VARS test_20190208_narnaud_virgo_status initialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
RETRY test_20190208_narnaud_virgo_status 1
(...)
and the sub file reads
universe = vanilla
executable = /users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
arguments = "--event_gps 1233176418.54321 --event_id test_20190208_narnaud --data_stream /virgoData/ffl/raw.ffl --output_dir /data/procdata/web/dqr/test_20190208_narnaud --n_seconds_backward 10 --n_seconds_forward 10"
priority = 10
getenv = True
error = /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
output = /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
notification = never
+Experiment = "DetChar"
+AccountingGroup= "virgo.prod.o3.detchar.transient.dqr"
queue 1
=> Would you know what could cause this error? And whether this is at my
level (user) or at the level of the farm?
=> And, until the problem is fixed, would there be a way to convince the
dag to continue instead of aborting? Possibly by modifying the default
value of the macro
DAGMAN_ALLOW_EVENTS = 114
? But changing this value to 5 [!?] is said to "break the semantics of
the DAG" => I'm not sure this is the right way to proceed.
Thanks in advance for your help,
Nicolas