Hi All,
after applying workaround suggested all seems working properly.
Thanks again.
Giuseppe
On 2/19/19 10:56 PM, Mark Coatsworth
wrote:
Hi everyone,
I got some help from Jaime on this, who was finally
able to figure out what's happening. There's a bug in
condor_shadow which sometimes causes it to report old job
IDs in the event log. We've written up a ticket with more
details:
Jaime is working on a fix for this which should be
available in our next stable release (8.8.2).
In the meantime, there's a workaround you can use that
is safer than the one we talked about previously. If you
set a value for your EVENT_LOG configuration knob (which
is empty by default), that should force the shadow to run
the code that refreshes the job ID. The value should be
the path to a file where events can get written. More info
here:
Please give this a try, and we'll keep you posted on
getting this problem fixed.
Mark
Hi Mark,
unfortunately those logs are gone... so Nicolas re-run
his dags... and BAD EVENT occurred on job id 285383.0. I
sent logs about this job.
So you can find out about bug in those logs.
Cheers
Giuseppe
--
===============================================
Giuseppe Di Biase - giuseppe.dibiase@xxxxxxxxx
European Gravitational Observatory - EGO
Via E.Amaldi - 56021 Cascina (Pisa) - IT
Phone: +39 050 752 577
===============================================
-------- Messaggio originale --------
Data: 15/02/19 17:03 (GMT+01:00)
Oggetto: Re: Fwd: Re: [HTCondor-users] DAG error:
"BAD EVENT: job (...) executing, total end count != 0
(1)"
Hi Giuseppe, thanks for sending all this
information.
Unfortunately the ShadowLog and SchedLog files you
sent do not have any information about Nicolas' job
(281392.0.0). This job ran on Feb 7, whereas the logs
you sent only have information from Feb 14, so the
information we need has already gone stale.
It's possible you'll find some information in
ShadowLog.old and SchedLog.old (look for string
"281392"), if so, please send those to me.
If there's no information in there, I'll need you
to coordinate. Nicolas will need to run the jobs again
until they produce the same error, then I'll need
Giuseppe to send me the log files right away while
they still contain the information we need.
Sorry for the inconvenience. There's just no way
for us to diagnose this problem without the log
output.
Mark
Hi Mark,
i prefere to send you logs instead of creating a linux
account for now.
So i hope to find out some configuration error by logs.
In attachment 4 files as requested.
Thanks
Giuseppe
>
>
>>
>>
>> -------- Message transfÃrà --------
>> SujetÂ:ÂÂÂÂ Re: [HTCondor-users] DAG error:
"BAD EVENT: job (...)
>> executing, total end count != 0 (1)"
>> DateÂ:ÂÂÂÂ Wed, 13 Feb 2019 20:15:24 +0000
>> DeÂ:ÂÂÂÂ Mark Coatsworth <coatsworth@xxxxxxxxxxx>
>> PourÂ:ÂÂÂÂ Nicolas Arnaud <narnaud@xxxxxxxxxxxx>
>> Copie ÃÂ:ÂÂÂÂ HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>>
>> Hi Nicolas, thanks for all this information!
>>
>> I looked through your log files, it turns out
this was not the
>> problem I expected. The ULOG_EXECUTE event
actually does appear twice
>> in the log -- so it's not an issue with our
log-reading code (which
>> was the case with those bugs I mentioned). For
some reason it looks
>> like your schedd is actually executing the same
job twice.
>>
>> We're going to need a few more things to help
debug this. Could you
>> please send me the following:
>> * ScheddLog
>> * ShadowLog
>> * Your job classad (can retrieve this by
running "condor_history
>> -lÂ281392"
>> * The output on your submit server from running
"ps auxww | grep condor"
>>
>> Also, is there any way I can get a user account
to log in to your
>> submit server? We discussed this at our team
meeting this morning and
>> everybody thinks the problem is related to your
environment. So it
>> might be easier for us to debug if we can get
access, rather than
>> keep asking you to send us things over email.
>>
>> Mark
>>
>>
>>
>> On Tue, Feb 12, 2019 at 3:08 PM Nicolas Arnaud
<narnaud@xxxxxxxxxxxx
>> <mailto:narnaud@xxxxxxxxxxxx>>
wrote:
>>
>>
>> ÂÂÂ Hi Mark,
>> ÂÂÂÂ > I've been looking into this.
>>
>> ÂÂÂ Thanks!
>>
>> ÂÂÂÂ > (...)
>> ÂÂÂÂ > Are you running on Windows or Linux?
It seems that all previous
>> ÂÂÂÂ > occurrences of this problem happened
on Windows.
>>
>> ÂÂÂ I'm running on Linux. Some information:
>>
>> ÂÂÂÂ > condor_version
>> ÂÂÂÂ > $CondorVersion: 8.6.13 Oct 30 2018
BuildID: 453497 $
>> ÂÂÂÂ > $CondorPlatform: x86_64_RedHat7 $
>> ÂÂÂÂ > echo $UNAME
>> ÂÂÂÂ > Linux-x86_64-CL7
>>
>> ÂÂÂÂ > These bugs were never resolved,
although it seems like Kent spent
>> ÂÂÂ some
>> ÂÂÂÂ > time on them and determined the
problem was most likely in the
>> ÂÂÂÂ > log-reading code (so at the user
level, not the farm). However
>> ÂÂÂ it's hard
>> ÂÂÂÂ > to tell without seeing what events
are actually showing up in the
>> ÂÂÂ log.
>> ÂÂÂÂ > I'd like to try and reproduce this
locally -- could you send
>> your a)
>> ÂÂÂÂ > .nodes.log file, b) .dagman.out file,
c) full .dag file? These
>> ÂÂÂ should
>> ÂÂÂÂ > help me figure out where the bug is
happening.
>>
>> ÂÂÂ Please find in attachment two sets of these
three files:
>>
>> ÂÂÂÂ Â Â* those tagged "20190207_narnaud_2"
correspond to a "BAD EVENT"
>> ÂÂÂ case
>> ÂÂÂ followed by a dag abort
(DAGMAN_ALLOW_EVENTS = 114, the default
>> value)
>>
>> ÂÂÂÂ Â Â* those tagged "20190212_narnaud_7"
correspond to a "BAD EVENT"
>> ÂÂÂ case,
>> ÂÂÂ mitigated by DAGMAN_ALLOW_EVENTS = 5: the
dag goes on until
>> completion.
>>
>> ÂÂÂ As the dag file relies on independent sub
files, I am also
>> sending you
>> ÂÂÂ the template sub file we're using to
generate all the individual
>> task
>> ÂÂÂ sub files.
>>
>> ÂÂÂÂ > For a short term workaround, you
could try adjusting the value of
>> ÂÂÂÂ > DAGMAN_ALLOW_EVENTS to 5 like you
suggested. It's true this could
>> ÂÂÂ affect
>> ÂÂÂÂ > the semantics, but I think the worst
case is that DAGMan could
>> ÂÂÂ get stuck
>> ÂÂÂÂ > in a logical loop. If you're able to
keep an eye on its
>> progress and
>> ÂÂÂÂ > manually abort if necessary, I think
this should work.
>>
>> ÂÂÂ See above: indeed setting
DAGMAN_ALLOW_EVENTS = 5 allows the dag to
>> ÂÂÂ go on.
>>
>> ÂÂÂ The point is that since I've noticed this
issue I am always
>> running the
>> ÂÂÂ "same" dag: the only thing that changes is
its tag -- basically
>> driving
>> ÂÂÂ the output directory and used for many
filenames. In about 40% of
>> the
>> ÂÂÂ cases, I get a "BAD EVENT" error but each
time it affects a
>> different
>> ÂÂÂ task and so happens at different times of
the dag processing as the
>> ÂÂÂ tasks have very different durations. While
in about 60% of the
>> cases,
>> ÂÂÂ the dag completes fine w/o any "BAD EVENT".
>>
>> ÂÂÂ Let me know if you need more information or
if anything is unclear.
>>
>> ÂÂÂ Cheers,
>>
>> ÂÂÂ Nicolas
>>
>> ÂÂÂÂ > Mark
>> ÂÂÂÂ >
>> ÂÂÂÂ >
>> ÂÂÂÂ > On Tue, Feb 12, 2019 at 2:42 AM
Nicolas Arnaud
>> ÂÂÂ <narnaud@xxxxxxxxxxxx
<mailto:narnaud@xxxxxxxxxxxx>
>> ÂÂÂÂ > <mailto:narnaud@xxxxxxxxxxxx
<mailto:narnaud@xxxxxxxxxxxx>>>
>> wrote:
>> ÂÂÂÂ >
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â ÂHello,
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â ÂI'm using a Condor farm to run
dags containing a dozen of
>> ÂÂÂ independent
>> ÂÂÂÂ >Â Â Âtasks, each task being made of a
few processes running
>> ÂÂÂ sequentially
>> ÂÂÂÂ >Â Â Âfollowing the parent/child logic.
Lately I have encountered
>> ÂÂÂ errors like
>> ÂÂÂÂ >Â Â Âthe one below:
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â Â > (...)
>> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Event:
ULOG_IMAGE_SIZE for HTCondor Node
>> ÂÂÂÂ >Â Â
Âtest_20190208_narnaud_virgo_status (281605.0.0)
{02/08/19
>> ÂÂÂ 00:30:06}
>> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Event:
ULOG_JOB_TERMINATED for
>> HTCondor Node
>> ÂÂÂÂ >Â Â
Âtest_20190208_narnaud_virgo_status (281605.0.0)
{02/08/19
>> ÂÂÂ 00:30:06}
>> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Number of
idle job procs: 0
>> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Node
>> test_20190208_narnaud_virgo_status job
>> ÂÂÂÂ >Â Â Âproc (281605.0.0) completed
successfully.
>> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Node
>> test_20190208_narnaud_virgo_status job
>> ÂÂÂÂ >Â Â Âcompleted
>> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Event:
ULOG_EXECUTE for HTCondor Node
>> ÂÂÂÂ >Â Â
Âtest_20190208_narnaud_virgo_status (281605.0.0)
{02/08/19
>> ÂÂÂ 00:30:07}
>> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 BAD
EVENT: job (281605.0.0)
>> executing, total
>> ÂÂÂÂ >Â Â Âend count != 0 (1)
>> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 ERROR:
aborting DAG because of bad event
>> ÂÂÂ (BAD
>> ÂÂÂÂ >Â Â ÂEVENT: job (281605.0.0)
executing, total end count != 0 (1))
>> ÂÂÂÂ >Â Â Â > (...)
>> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10
ProcessLogEvents() returned false
>> ÂÂÂÂ >Â Â Â > 02/08/19 00:30:10 Aborting
DAG...
>> ÂÂÂÂ >Â Â Â > (...)
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â ÂCondor correctly asseses one job
as being successfully
>> ÂÂÂ completed but it
>> ÂÂÂÂ >Â Â Âseems that it starts executing it
again immediately. Then
>> ÂÂÂ there is a
>> ÂÂÂÂ >Â Â Â"BAD EVENT" error and the DAG
aborts, killing all the jobs
>> ÂÂÂ that were
>> ÂÂÂÂ >Â Â Ârunning.
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â ÂSo far this problem seems to
occur randomly: some dags
>> ÂÂÂ complete fine
>> ÂÂÂÂ >Â Â Âwhile, when the problem occurs,
the job that suffers from
>> it is
>> ÂÂÂÂ >Â Â Âdifferent each time. So are the
machine and the slot on which
>> ÂÂÂ that
>> ÂÂÂÂ >Â Â Âparticular job is running.
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â ÂIn the above example, the dag
snippet is fairly simple
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â Â > (...)
>> ÂÂÂÂ >Â Â Â > JOB
test_20190208_narnaud_virgo_status virgo_status.sub
>> ÂÂÂÂ >Â Â Â > VARS
test_20190208_narnaud_virgo_status
>> ÂÂÂÂ >
Âinitialdir="/data/procdata/web/dqr/test_20190208_narnaud/dag"
>> ÂÂÂÂ >Â Â Â > RETRY
test_20190208_narnaud_virgo_status 1
>> ÂÂÂÂ >Â Â Â > (...)
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â Âand the sub file reads
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â Â > universe = vanilla
>> ÂÂÂÂ >Â Â Â > executable =
>> ÂÂÂÂ >
>>
>>
Â/users/narnaud/Software/RRT/Virgo/VirgoDQR/trunk/scripts/virgo_status.py
>>
>> ÂÂÂÂ >Â Â Â > arguments = "--event_gps
1233176418.54321 --event_id
>> ÂÂÂÂ >Â Â Âtest_20190208_narnaud
--data_stream /virgoData/ffl/raw.ffl
>> ÂÂÂÂ >Â Â Â--output_dir
/data/procdata/web/dqr/test_20190208_narnaud
>> ÂÂÂÂ >Â Â Â--n_seconds_backward 10
--n_seconds_forward 10"
>> ÂÂÂÂ >Â Â Â > priority = 10
>> ÂÂÂÂ >Â Â Â > getenv = True
>> ÂÂÂÂ >Â Â Â > error =
>> ÂÂÂÂ >
>>
>>
Â/data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).err
>>
>> ÂÂÂÂ >Â Â Â > output =
>> ÂÂÂÂ >
>>
>>
Â/data/procdata/web/dqr/test_20190208_narnaud/virgo_status/logs/$(cluster)-$(process)-$$(Name).out
>>
>> ÂÂÂÂ >Â Â Â > notification = never
>> ÂÂÂÂ >Â Â Â > +Experiment = "DetChar"
>> ÂÂÂÂ >Â Â Â > +AccountingGroup=
"virgo.prod.o3.detchar.transient.dqr"
>> ÂÂÂÂ >Â Â Â > queue 1
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â Â=> Would you know what could
cause this error? And whether
>> ÂÂÂ this is
>> ÂÂÂÂ >Â Â Âat my
>> ÂÂÂÂ >Â Â Âlevel (user) or at the level of
the farm?
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â Â=> And, until the problem is
fixed, would there be a way to
>> ÂÂÂ convince
>> ÂÂÂÂ >Â Â Âthe
>> ÂÂÂÂ >Â Â Âdag to continue instead of
aborting? Possibly by modifying
>> ÂÂÂ the default
>> ÂÂÂÂ >Â Â Âvalue of the macro
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â Â > DAGMAN_ALLOW_EVENTS = 114
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â Â? But changing this value to 5
[!?] is said to "break the
>> ÂÂÂ semantics of
>> ÂÂÂÂ >Â Â Âthe DAG" => I'm not sure this
is the right way to proceed.
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â ÂThanks in advance for your help,
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â ÂNicolas
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â
Â_______________________________________________
>> ÂÂÂÂ >Â Â ÂHTCondor-users mailing list
>> ÂÂÂÂ >Â Â ÂTo unsubscribe, send a message to
>> ÂÂÂ htcondor-users-request@xxxxxxxxxxx
>> ÂÂÂ <mailto:htcondor-users-request@xxxxxxxxxxx>
>> ÂÂÂÂ >Â Â Â<mailto:htcondor-users-request@xxxxxxxxxxx
>> ÂÂÂ <mailto:htcondor-users-request@xxxxxxxxxxx>>
with a
>> ÂÂÂÂ >Â Â Âsubject: Unsubscribe
>> ÂÂÂÂ >Â Â ÂYou can also unsubscribe by
visiting
>> ÂÂÂÂ >
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> ÂÂÂÂ >
>> ÂÂÂÂ >Â Â ÂThe archives can be found at:
>> ÂÂÂÂ >
https://lists.cs.wisc.edu/archive/htcondor-users/
>> ÂÂÂÂ >
>> ÂÂÂÂ >
>> ÂÂÂÂ >
>> ÂÂÂÂ > --
>> ÂÂÂÂ > Mark Coatsworth
>> ÂÂÂÂ > Systems Programmer
>> ÂÂÂÂ > Center for High Throughput Computing
>> ÂÂÂÂ > Department of Computer Sciences
>> ÂÂÂÂ > University of Wisconsin-Madison
>> ÂÂÂÂ > +1 608 206 4703
>> ÂÂÂÂ >
>> ÂÂÂÂ >
_______________________________________________
>> ÂÂÂÂ > HTCondor-users mailing list
>> ÂÂÂÂ > To unsubscribe, send a message to
>> ÂÂÂ htcondor-users-request@xxxxxxxxxxx
>> ÂÂÂ <mailto:htcondor-users-request@xxxxxxxxxxx>
with a
>> ÂÂÂÂ > subject: Unsubscribe
>> ÂÂÂÂ > You can also unsubscribe by visiting
>> ÂÂÂÂ >
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> ÂÂÂÂ >
>> ÂÂÂÂ > The archives can be found at:
>> ÂÂÂÂ >
https://lists.cs.wisc.edu/archive/htcondor-users/
>> ÂÂÂÂ >
>>
>>
>>
--
Mark Coatsworth
Systems Programmer
Center for
High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison
--
Mark Coatsworth
Systems Programmer
Center for High
Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison
|