Hi Mark,
i prefere to send you logs instead of creating a linux account for now.
So i hope to find out some configuration error by logs.
In attachment 4 files as requested.
Thanks
Giuseppe
>
>
>>
>>
>> -------- Message transfÃrà --------
>> Sujet : Re: [HTCondor-users] DAG error: "BAD EVENT: job (...)
>> executing, total end count != 0 (1)"
>> Date : Wed, 13 Feb 2019 20:15:24 +0000
>> De : Mark Coatsworth <coatsworth@xxxxxxxxxxx>
>> Pour : Nicolas Arnaud <narnaud@xxxxxxxxxxxx>
>> Copie à : HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>>
>> Hi Nicolas, thanks for all this information!
>>
>> I looked through your log files, it turns out this was not the
>> problem I expected. The ULOG_EXECUTE event actually does appear twice
>> in the log -- so it's not an issue with our log-reading code (which
>> was the case with those bugs I mentioned). For some reason it looks
>> like your schedd is actually executing the same job twice.
>>
>> We're going to need a few more things to help debug this. Could you
>> please send me the following:
>> * ScheddLog
>> * ShadowLog
>> * Your job classad (can retrieve this by running "condor_history
>> -l 281392"
>> * The output on your submit server from running "ps auxww | grep condor"
>>
>> Also, is there any way I can get a user account to log in to your
>> submit server? We discussed this at our team meeting this morning and
>> everybody thinks the problem is related to your environment. So it
>> might be easier for us to debug if we can get access, rather than
>> keep asking you to send us things over email.
>>
>> Mark
>>
>>
>>
>> On Tue, Feb 12, 2019 at 3:08 PM Nicolas Arnaud <narnaud@xxxxxxxxxxxx
>> <mailto:narnaud@xxxxxxxxxxxx>> wrote:
>>
>>
>> Hi Mark,
>> > I've been looking into this.
>>
>> Thanks!
>>
>> > (...)
>> > Are you running on Windows or Linux? It seems that all previous
>> > occurrences of this problem happened on Windows.
>>
>> I'm running on Linux. Some information:
>>
>> > condor_version
>> > $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
>> > $CondorPlatform: x86_64_RedHat7 $
>> > echo $UNAME
>> > Linux-x86_64-CL7
>>
>> > These bugs were never resolved, although it seems like Kent spent
>> some
>> > time on them and determined the problem was most likely in the
>> > log-reading code (so at the user level, not the farm). However
>> it's hard
>> > to tell without seeing what events are actually showing up in the
>> log.
>> > I'd like to try and reproduce this locally -- could you send
>> your a)
>> > .nodes.log file, b) .dagman.out file, c) full .dag file? These
>> should
>> > help me figure out where the bug is happening.
>>
>> Please find in attachment two sets of these three files:
>>
>> * those tagged "20190207_narnaud_2" correspond to a "BAD EVENT"
>> case
>> followed by a dag abort (DAGMAN_ALLOW_EVENTS = 114, the default
>> value)
>>
>> * those tagged "20190212_narnaud_7" correspond to a "BAD EVENT"
>> case,
>> mitigated by DAGMAN_ALLOW_EVENTS = 5: the dag goes on until
>> completion.
>>
>> As the dag file relies on independent sub files, I am also
>> sending you
>> the template sub file we're using to generate all the individual
>> task
>> sub files.
>>
>> > For a short term workaround, you could try adjusting the value of
>> > DAGMAN_ALLOW_EVENTS to 5 like you suggested. It's true this could
>> affect
>> > the semantics, but I think the worst case is that DAGMan could
>> get stuck
>> > in a logical loop. If you're able to keep an eye on its
>> progress and
>> > manually abort if necessary, I think this should work.
>>
>> See above: indeed setting DAGMAN_ALLOW_EVENTS = 5 allows the dag to
>> go on.
>>
>> The point is that since I've noticed this issue I am always
>> running the
>> "same" dag: the only thing that changes is its tag -- basically
>> driving
>> the output directory and used for many filenames. In about 40% of
>> the
>> cases, I get a "BAD EVENT" error but each time it affects a
>> different
>> task and so happens at different times of the dag processing as the
>> tasks have very different durations. While in about 60% of the
>> cases,
>> the dag completes fine w/o any "BAD EVENT".
>>
>> Let me know if you need more information or if anything is unclear.
>>
>> Cheers,
>>
>> Nicolas
>>
>> > Mark
>> >
>> >
>> > On Tue, Feb 12, 2019 at 2:42 AM Nicolas Arnaud
>> <narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>
>> > <mailto:narnaud@xxxxxxxxxxxx <mailto:narnaud@xxxxxxxxxxxx>>>
>> wrote:
>> >
>> >
>> > Hello,
>> >
>> > I'm using a Condor farm to run dags containing a dozen of
>> independent
>> > tasks, each task being made of a few processes running
>> sequentially
>> > following the parent/child logic. Lately I have encountered
>> errors like
>> > the one below:
>> >
>> > > (...)
>> > > 02/08/19 00:30:10 Event: ULOG_IMAGE_SIZE for HTCondor Node
>> > test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19
>> 00:30:06}
>> > > 02/08/19 00:30:10 Event: ULOG_JOB_TERMINATED for
>> HTCondor Node
>> > test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19
>> 00:30:06}
>> > > 02/08/19 00:30:10 Number of idle job procs: 0
>> > > 02/08/19 00:30:10 Node
>> test_20190208_narnaud_virgo_status job
>> > proc (281605.0.0) completed successfully.
>> > > 02/08/19 00:30:10 Node
>> test_20190208_narnaud_virgo_status job
>> > completed
>> > > 02/08/19 00:30:10 Event: ULOG_EXECUTE for HTCondor Node
>> > test_20190208_narnaud_virgo_status (281605.0.0) {02/08/19
>> 00:30:07}
>> > > 02/08/19 00:30:10 BAD EVENT: job (281605.0.0)
>> executing, total
>> > end count != 0 (1)
>> > > 02/08/19 00:30:10 ERROR: aborting DAG because of bad event
>> (BAD
>> > EVENT: job (281605.0.0) executing, total end count != 0 (1))
>> > > (...)
>> > > 02/08/19 00:30:10 ProcessLogEvents() returned false
>> > > 02/08/19 00:30:10 Aborting DAG...
>> > > (...)
>> >
>> > Condor correctly asseses one job as being successfully
>> completed but it
>> > seems that it starts executing it again immediately. Then
>> there is a
>> > "BAD EVENT" error and the DAG aborts, killing all the jobs
>> that were
>> > running.
>> >
>> > So far this problem seems to occur randomly: some dags
>> complete fine
>> > while, when the problem occurs, the job that suffers from
>> it is
>> > different each time. So are the machine and the slot on which
>> that
>> > particular job is running.
>> >
>> > In the above example, the dag snippet is fairly simple
>> >
>> > > (...)
>> > > JOB test_20190208_narnaud_virgo_status virgo_status.sub
>> > > VARS test_20190208_narnaud_virgo_status
>> > initialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
>> > > RETRY test_20190208_narnaud_virgo_status 1
>> > > (...)
>> >
>> > and the sub file reads
>> >
>> > > universe = vanilla
>> > > executable =
>> >
>>
>> /users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
>>
>> > > arguments = "--event_gps 1233176418.54321 --event_id
>> > test_20190208_narnaud --data_stream /virgoData/ffl/raw.ffl
>> > --output_dir /data/procdata/web/dqr/test_20190208_narnaud
>> > --n_seconds_backward 10 --n_seconds_forward 10"
>> > > priority = 10
>> > > getenv = True
>> > > error =
>> >
>>
>> /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
>>
>> > > output =
>> >
>>
>> /data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
>>
>> > > notification = never
>> > > +Experiment = "DetChar"
>> > > +AccountingGroup= "virgo.prod.o3.detchar.transient.dqr"
>> > > queue 1
>> >
>> > => Would you know what could cause this error? And whether
>> this is
>> > at my
>> > level (user) or at the level of the farm?
>> >
>> > => And, until the problem is fixed, would there be a way to
>> convince
>> > the
>> > dag to continue instead of aborting? Possibly by modifying
>> the default
>> > value of the macro
>> >
>> > > DAGMAN_ALLOW_EVENTS = 114
>> >
>> > ? But changing this value to 5 [!?] is said to "break the
>> semantics of
>> > the DAG" => I'm not sure this is the right way to proceed.
>> >
>> > Thanks in advance for your help,
>> >
>> > Nicolas
>> >
>> > _______________________________________________
>> > HTCondor-users mailing list
>> > To unsubscribe, send a message to
>> htcondor-users-request@xxxxxxxxxxx
>> <mailto:htcondor-users-request@xxxxxxxxxxx>
>> > <mailto:htcondor-users-request@xxxxxxxxxxx
>> <mailto:htcondor-users-request@xxxxxxxxxxx>> with a
>> > subject: Unsubscribe
>> > You can also unsubscribe by visiting
>> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> >
>> > The archives can be found at:
>> > https://lists.cs.wisc.edu/archive/htcondor-users/
>> >
>> >
>> >
>> > --
>> > Mark Coatsworth
>> > Systems Programmer
>> > Center for High Throughput Computing
>> > Department of Computer Sciences
>> > University of Wisconsin-Madison
>> > +1 608 206 4703
>> >
>> > _______________________________________________
>> > HTCondor-users mailing list
>> > To unsubscribe, send a message to
>> htcondor-users-request@xxxxxxxxxxx
>> <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>> > subject: Unsubscribe
>> > You can also unsubscribe by visiting
>> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> >
>> > The archives can be found at:
>> > https://lists.cs.wisc.edu/archive/htcondor-users/
>> >
>>
>>
>>