Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman BAD EVENT: executed before submitted

Date: Tue, 26 Jan 2010 10:09:33 -0600 (CST)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [Condor-users] Dagman BAD EVENT: executed before submitted

On Mon, 25 Jan 2010, Greg Langmead wrote:

We've been seeing this behavior in Condor 7.2 for the last severalmonths (on Red Had FC8), and now I'm testing 7.4 (on Red Hat FC12) andseeing it again. It is very sensitive to the exact hardware that isplaying the role of collector. Some machines hit this problem in 100% ofdags, others in 0%. All of our machines are identical, or intended to beso. We've worked around it by abandoning the use of some machines ascollectors.
The symptom is that the dagman job is itself evicted from thecollector. The eviction is logged in the .dagman.log file just like anyother eviction:

I think that the fact that DAGMan is getting evicted is just a result ofthe assertion.

In the .dagman.out I see this -- the job that caused the problem is 838and I snipped non-838-related activity:
==
01/25 13:25:38 From submit: Submitting job(s).
01/25 13:26:08 From submit: Logging submit event(s).
01/25 13:26:08 From submit: 1 job(s) submitted to cluster 838.
01/25 13:26:08  assigned Condor ID (838.0)

[snip]

01/25 13:26:08 Event: ULOG_EXECUTE for Condor Node MyJob (838.0)
01/25 13:26:08 BAD EVENT: job (838.0.0) executing, submit count < 1 (0)
01/25 13:26:08 BAD EVENT is warning only
01/25 13:26:08 Number of idle job procs: 2
01/25 13:26:08 Event: ULOG_JOB_TERMINATED for Condor Node MyJob (838.0)
01/25 13:26:08 BAD EVENT: job (838.0.0) ended, submit count < 1 (0)
01/25 13:26:08 BAD EVENT is warning only
01/25 13:26:08 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs >= 0)" at line 3119 in file dag.cpp
==
Note that the ULOG_EXECUTE event came in with no preliminaryULOG_SUBMIT. I see another manifestation of this in the monolithic logfile that all my .condor files set as the "log = " destination:
==
001 (838.000.000) 01/25 13:25:49 Job executing on host: <192.168.131.114:52465>
...
005 (838.000.000) 01/25 13:25:55 Job terminated.
       (1) Normal termination (return value 0)
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
       0  -  Run Bytes Sent By Job
       0  -  Run Bytes Received By Job
       0  -  Total Bytes Sent By Job
       0  -  Total Bytes Received By Job
...
000 (838.000.000) 01/25 13:25:38 Job submitted from host: <192.168.129.17:53465>
   DAG Node: MyJob
...
==
The executing message is there before the submitted message. Thetimestamps look OK but they came into the log file out of order. The jobwas very small and took just 6 seconds.
Where can I go next to get more details to find the root cause? Maybeit's some kind of networking configuration in my data center, andtherefore fixable. Help!


There are at least two issues here:
1) Why are the events written out of order?
2) Why does this cause DAGMan to assert?

As far as #1 goes, the reason that this can happen is that the submitevent is written by condor_submit (after the job is actually submitted, sothat you don't get a submit event if things go wrong), and the executingand terminated events are written by either the schedd or the shadow (Ican't remember which off the top of my head). Anyhow, because events arewritten by different processes, there's a race condition for which onegets written first. In the past we've seen cases where the executingevent gets written before the submit event, but it's unusual for theterminated event to come before the submit event. (I guess that'spossible in your case because the jobs are so short.)

For case #2, DAGMan is supposed to tolerate this if you have the defaultvalue for DAGMAN_ALLOW_EVENTS. Can you send me the DAG file, node job logfile(s), and dagman.out file for one of the cases where DAGMan asserted?I'd like to take a look at this, and see if that case is not handledcorrectly.


Kent Wenger
Condor Team

References:
- [Condor-users] Dagman BAD EVENT: executed before submitted
  - From: Greg Langmead

Prev by Date: Re: [Condor-users] "vm_cdrom_files = a, b, c" : How does this work?
Next by Date: [Condor-users] file transfer between globus gatekeeper and condor execute nodes
Previous by thread: [Condor-users] Dagman BAD EVENT: executed before submitted
Next by thread: [Condor-users] Dynamic and static linking of Condor
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Dagman BAD EVENT: executed before submitted