Re: [HTCondor-users] Fwd: Re: DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

On Thu, Feb 14, 2019 at 9:17 AM Giuseppe Di Biase <giuseppe.dibiase@xxxxxxxxx> wrote:

Hi Mark,

i prefere to send you logs instead of creating a linux account for now.
So i hope to find out some configuration error by logs.

In attachment 4 files as requested.

Thanks

Giuseppe
>
>
>>
>>
>> -------- Message transfÃrÃ --------
>> Sujet :     Re: [HTCondor-users] DAG error: "BAD EVENT: job (...)
>> executing, total end count != 0 (1)"
>> Date :     Wed, 13 Feb 2019 20:15:24 +0000
>> De :     Mark Coatsworth <coatsworth@xxxxxxxxxxx>
>> Pour :     Nicolas Arnaud <narnaud@xxxxxxxxxxxx>
>> Copie Ã :     HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>>
>> Hi Nicolas, thanks for all this information!
>>
>> I looked through your log files, it turns out this was not the
>> problem I expected. The ULOG_EXECUTE event actually does appear twice
>> in the log -- so it's not an issue with our log-reading code (which
>> was the case with those bugs I mentioned). For some reason it looks
>> like your schedd is actually executing the same job twice.
>>
>> We're going to need a few more things to help debug this. Could you
>> please send me the following:
>> * ScheddLog
>> * ShadowLog
>> * Your job classad (can retrieve this by running "condor_history
>> -l 281392"
>> * The output on your submit server from running "ps auxww | grep condor"
>>
>> Also, is there any way I can get a user account to log in to your
>> submit server? We discussed this at our team meeting this morning and
>> everybody thinks the problem is related to your environment. So it
>> might be easier for us to debug if we can get access, rather than
>> keep asking you to send us things over email.
>>
>> Mark
>>
>>
>>
>> On Tue, Feb 12, 2019 at 3:08 PM Nicolas Arnaud <narnaud@xxxxxxxxxxxx
>> <mailto:narnaud@xxxxxxxxxxxx>> wrote:
>>
>>
>>     Hi Mark,
>>      > I've been looking into this.
>>
>>     Thanks!
>>
>>      > (...)
>>      > Are you running on Windows or Linux? It seems that all previous
>>      > occurrences of this problem happened on Windows.
>>
>>     I'm running on Linux. Some information:
>>
>>      > condor_version
>>      > $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
>>      > $CondorPlatform: x86_64_RedHat7 $
>>      > echo $UNAME
>>      > Linux-x86_64-CL7
>>
>>      > These bugs were never resolved, although it seems like Kent spent
>>     some
>>      > time on them and determined the problem was most likely in the
>>      > log-reading code (so at the user level, not the farm). However
>>     it's hard
>>      > to tell without seeing what events are actually showing up in the
>>     log.
>>      > I'd like to try and reproduce this locally -- could you send
>> your a)
>>      > .nodes.log file, b) .dagman.out file, c) full .dag file? These
>>     should
>>      > help me figure out where the bug is happening.
>>
>>     Please find in attachment two sets of these three files:
>>
>>      * those tagged "20190207_narnaud_2" correspond to a "BAD EVENT"
>>     case
>>     followed by a dag abort (DAGMAN_ALLOW_EVENTS = 114, the default
>> value)
>>
>>      * those tagged "20190212_narnaud_7" correspond to a "BAD EVENT"
>>     case,
>>     mitigated by DAGMAN_ALLOW_EVENTS = 5: the dag goes on until
>> completion.
>>
>>     As the dag file relies on independent sub files, I am also
>> sending you
>>     the template sub file we're using to generate all the individual
>> task
>>     sub files.
>>
>>      > For a short term workaround, you could try adjusting the value of
>>      > DAGMAN_ALLOW_EVENTS to 5 like you suggested. It's true this could
>>     affect
>>      > the semantics, but I think the worst case is that DAGMan could
>>     get stuck
>>      > in a logical loop. If you're able to keep an eye on its
>> progress and
>>      > manually abort if necessary, I think this should work.
>>
>>     See above: indeed setting DAGMAN_ALLOW_EVENTS = 5 allows the dag to
>>     go on.
>>
>>     The point is that since I've noticed this issue I am always
>> running the
>>     "same" dag: the only thing that changes is its tag -- basically
>> driving
>>     the output directory and used for many filenames. In about 40% of
>> the
>>     cases, I get a "BAD EVENT" error but each time it affects a
>> different
>>     task and so happens at different times of the dag processing as the
>>     tasks have very different durations. While in about 60% of the
>> cases,
>>     the dag completes fine w/o any "BAD EVENT".
>>
>>     Let me know if you need more information or if anything is unclear.
>>
>>     Cheers,
>>
>>     Nicolas
>>
>>      > Mark
>>      >
>>      >
>>      > On Tue, Feb 12, 2019 at 2:42 AM Nicolas Arnaud
>>     <narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>
>>      > <mailto:narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>>>
>> wrote:
>>      >
>>      >
>>      > Hello,
>>      >
>>      > I'm using a Condor farm to run dags containing a dozen of
>>     independent
>>      > tasks, each task being made of a few processes running
>>     sequentially
>>      > following the parent/child logic. Lately I have encountered
>>     errors like
>>      > the one below:
>>      >
>>      > > (...)
>>      > > 02/08/19 00:30:10 Event: ULOG_IMAGE_SIZE for HTCondor Node
>>      > test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19
>>     00:30:06}
>>      > > 02/08/19 00:30:10 Event: ULOG_JOB_TERMINATED for
>> HTCondor Node
>>      > test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19
>>     00:30:06}
>>      > > 02/08/19 00:30:10 Number of idle job procs: 0
>>      > > 02/08/19 00:30:10 Node
>> test_20190208_narnaud_virgo_status job
>>      > proc (281605.0.0) completed successfully.
>>      > > 02/08/19 00:30:10 Node
>> test_20190208_narnaud_virgo_status job
>>      > completed
>>      > > 02/08/19 00:30:10 Event: ULOG_EXECUTE for HTCondor Node
>>      > test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19
>>     00:30:07}
>>      > > 02/08/19 00:30:10 BAD EVENT: job (281605.0.0)
>> executing, total
>>      > end count != 0 (1)
>>      > > 02/08/19 00:30:10 ERROR: aborting DAG because of bad event
>>     (BAD
>>      > EVENT: job (281605.0.0) executing, total end count != 0 (1))
>>      > > (...)
>>      > > 02/08/19 00:30:10 ProcessLogEvents() returned false
>>      > > 02/08/19 00:30:10 Aborting DAG...
>>      > > (...)
>>      >
>>      > Condor correctly asseses one job as being successfully
>>     completed but it
>>      > seems that it starts executing it again immediately. Then
>>     there is a
>>      > "BAD EVENT" error and the DAG aborts, killing all the jobs
>>     that were
>>      > running.
>>      >
>>      > So far this problem seems to occur randomly: some dags
>>     complete fine
>>      > while, when the problem occurs, the job that suffers from
>> it is
>>      > different each time. So are the machine and the slot on which
>>     that
>>      > particular job is running.
>>      >
>>      > In the above example, the dag snippet is fairly simple
>>      >
>>      > > (...)
>>      > > JOB test_20190208_narnaud_virgo_status virgo_status.sub
>>      > > VARS test_20190208_narnaud_virgo_status
>>      > initialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
>>      > > RETRY test_20190208_narnaud_virgo_status 1
>>      > > (...)
>>      >
>>      > and the sub file reads
>>      >
>>      > > universe = vanilla
>>      > > executable =
>>      >
>>
>> /users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
>>
>>      > > arguments = "--event_gps 1233176418.54321 --event_id
>>      > test_20190208_narnaud --data_stream /virgoData/ffl/raw.ffl
>>      > --output_dir /data/procdata/web/dqr/test_20190208_narnaud
>>      > --n_seconds_backward 10 --n_seconds_forward 10"
>>      > > priority = 10
>>      > > getenv = True
>>      > > error =
>>      >
>>
>> /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
>>
>>      > > output =
>>      >
>>
>> /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
>>
>>      > > notification = never
>>      > > +Experiment = "DetChar"
>>      > > +AccountingGroup= "virgo.prod.o3.detchar.transient.dqr"
>>      > > queue 1
>>      >
>>      > => Would you know what could cause this error? And whether
>>     this is
>>      > at my
>>      > level (user) or at the level of the farm?
>>      >
>>      > => And, until the problem is fixed, would there be a way to
>>     convince
>>      > the
>>      > dag to continue instead of aborting? Possibly by modifying
>>     the default
>>      > value of the macro
>>      >
>>      > > DAGMAN_ALLOW_EVENTS = 114
>>      >
>>      > ? But changing this value to 5 [!?] is said to "break the
>>     semantics of
>>      > the DAG" => I'm not sure this is the right way to proceed.
>>      >
>>      > Thanks in advance for your help,
>>      >
>>      > Nicolas
>>      >
>>      > _______________________________________________
>>      > HTCondor-users mailing list
>>      > To unsubscribe, send a message to
>>     htcondor-users-request@xxxxxxxxxxx
>>     <mailto:htcondor-users-request@xxxxxxxxxxx>
>>      > <mailto:htcondor-users-request@xxxxxxxxxxx
>>     <mailto:htcondor-users-request@xxxxxxxxxxx>> with a
>>      > subject: Unsubscribe
>>      > You can also unsubscribe by visiting
>>      > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>      >
>>      > The archives can be found at:
>>      > https://lists.cs.wisc.edu/archive/htcondor-users/
>>      >
>>      >
>>      >
>>      > --
>>      > Mark Coatsworth
>>      > Systems Programmer
>>      > Center for High Throughput Computing
>>      > Department of Computer Sciences
>>      > University of Wisconsin-Madison
>>      > +1 608 206 4703
>>      >
>>      > _______________________________________________
>>      > HTCondor-users mailing list
>>      > To unsubscribe, send a message to
>>     htcondor-users-request@xxxxxxxxxxx
>>     <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>>      > subject: Unsubscribe
>>      > You can also unsubscribe by visiting
>>      > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>      >
>>      > The archives can be found at:
>>      > https://lists.cs.wisc.edu/archive/htcondor-users/
>>      >
>>
>>
>>

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Fwd: Re: DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"