Hello,
I'm using a Condor farm to run dags containing a dozen of independent
tasks, each task being made of a few processes running sequentially
following the parent/child logic. Lately I have encountered errors like
the one below:
> (...)
> 02/08/19 00:30:10 Event: ULOG_IMAGE_SIZE for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
> 02/08/19 00:30:10 Event: ULOG_JOB_TERMINATED for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
> 02/08/19 00:30:10 Number of idle job procs: 0
> 02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job proc (281605.0.0) completed successfully.
> 02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job completed
> 02/08/19 00:30:10 Event: ULOG_EXECUTE for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:07}
> 02/08/19 00:30:10 BAD EVENT: job (281605.0.0) executing, total end count != 0 (1)
> 02/08/19 00:30:10 ERROR: aborting DAG because of bad event (BAD EVENT: job (281605.0.0) executing, total end count != 0 (1))
> (...)
> 02/08/19 00:30:10 ProcessLogEvents() returned false
> 02/08/19 00:30:10 Aborting DAG...
> (...)
Condor correctly asseses one job as being successfully completed but it
seems that it starts executing it again immediately. Then there is a
"BAD EVENT" error and the DAG aborts, killing all the jobs that were
running.
So far this problem seems to occur randomly: some dags complete fine
while, when the problem occurs, the job that suffers from it is
different each time. So are the machine and the slot on which that
particular job is running.
In the above example, the dag snippet is fairly simple
> (...)
> JOB test_20190208_narnaud_virgo_status virgo_status.sub
> VARS test_20190208_narnaud_virgo_status initialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
> RETRY test_20190208_narnaud_virgo_status 1
> (...)
and the sub file reads
> universe = vanilla
> executable = /users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
> arguments = "--event_gps 1233176418.54321 --event_id test_20190208_narnaud --data_stream /virgoData/ffl/raw.ffl --output_dir /data/procdata/web/dqr/test_20190208_narnaud --n_seconds_backward 10 --n_seconds_forward 10"
> priority = 10
> getenv = True
> error = /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
> output = /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
> notification = never
> +Experiment = "DetChar"
> +AccountingGroup= "virgo.prod.o3.detchar.transient.dqr"
> queue 1
=> Would you know what could cause this error? And whether this is at my
level (user) or at the level of the farm?
=> And, until the problem is fixed, would there be a way to convince the
dag to continue instead of aborting? Possibly by modifying the default
value of the macro
> DAGMAN_ALLOW_EVENTS = 114
? But changing this value to 5 [!?] is said to "break the semantics of
the DAG" => I'm not sure this is the right way to proceed.
Thanks in advance for your help,
Nicolas
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/