Re: [HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Nicolas,

I've been looking into this. Someone else reported this same problem on condor-users a few years ago:

https://lists.cs.wisc.edu/archive/htcondor-users/2012-January/msg00076.shtml

There are also two bugs we have logged with events getting double-reported:

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2781,4

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4769,4

Are you running on Windows or Linux? It seems that all previous occurrences of this problem happened on Windows.

These bugs were never resolved, although it seems like Kent spent some time on them and determined the problem was most likely in the log-reading code (so at the user level, not the farm). However it's hard to tell without seeing what events are actually showing up in the log. I'd like to try and reproduce this locally -- could you send your a) .nodes.log file, b) .dagman.out file, c) full .dag file? These should help me figure out where the bug is happening.

For a short term workaround, you could try adjusting the value of DAGMAN_ALLOW_EVENTS to 5 like you suggested. It's true this could affect the semantics, but I think the worst case is that DAGMan could get stuck in a logical loop. If you're able to keep an eye on its progress and manually abort if necessary, I think this should work.

Mark

On Tue, Feb 12, 2019 at 2:42 AM Nicolas Arnaud <narnaud@xxxxxxxxxxxx> wrote:

Hello,

I'm using a Condor farm to run dags containing a dozen of independent
tasks, each task being made of a few processes running sequentially
following the parent/child logic. Lately I have encountered errors like
the one below:

> (...)
> 02/08/19 00:30:10 Event: ULOG_IMAGE_SIZE for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
> 02/08/19 00:30:10 Event: ULOG_JOB_TERMINATED for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:06}
> 02/08/19 00:30:10 Number of idle job procs: 0
> 02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job proc (281605.0.0) completed successfully.
> 02/08/19 00:30:10 Node test_20190208_narnaud_virgo_status job completed
> 02/08/19 00:30:10 Event: ULOG_EXECUTE for HTCondor Node test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19 00:30:07}
> 02/08/19 00:30:10 BAD EVENT: job (281605.0.0) executing, total end count != 0 (1)
> 02/08/19 00:30:10 ERROR: aborting DAG because of bad event (BAD EVENT: job (281605.0.0) executing, total end count != 0 (1))
> (...)
> 02/08/19 00:30:10 ProcessLogEvents() returned false
> 02/08/19 00:30:10 Aborting DAG...
> (...)

Condor correctly asseses one job as being successfully completed but it
seems that it starts executing it again immediately. Then there is a
"BAD EVENT" error and the DAG aborts, killing all the jobs that were
running.

So far this problem seems to occur randomly: some dags complete fine
while, when the problem occurs, the job that suffers from it is
different each time. So are the machine and the slot on which that
particular job is running.

In the above example, the dag snippet is fairly simple

> (...)
> JOB test_20190208_narnaud_virgo_status virgo_status.sub
> VARS test_20190208_narnaud_virgo_status initialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
> RETRY test_20190208_narnaud_virgo_status 1
> (...)

and the sub file reads

> universe = vanilla
> executable = /users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
> arguments = "--event_gps 1233176418.54321 --event_id test_20190208_narnaud --data_stream /virgoData/ffl/raw.ffl --output_dir /data/procdata/web/dqr/test_20190208_narnaud --n_seconds_backward 10 --n_seconds_forward 10"
> priority = 10
> getenv = True
> error = /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
> output = /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
> notification = never
> +Experiment = "DetChar"
> +AccountingGroup= "virgo.prod.o3.detchar.transient.dqr"
> queue 1

=> Would you know what could cause this error? And whether this is at my
level (user) or at the level of the farm?

=> And, until the problem is fixed, would there be a way to convince the
dag to continue instead of aborting? Possibly by modifying the default
value of the macro

> DAGMAN_ALLOW_EVENTS = 114

? But changing this value to 5 [!?] is said to "break the semantics of
the DAG" => I'm not sure this is the right way to proceed.

Thanks in advance for your help,

Nicolas

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"