In the .dagman.out I see this -- the job that caused the problem is 838
and I snipped non-838-related activity:
==
01/25 13:25:38 From submit: Submitting job(s).
01/25 13:26:08 From submit: Logging submit event(s).
01/25 13:26:08 From submit: 1 job(s) submitted to cluster 838.
01/25 13:26:08 assigned Condor ID (838.0)
[snip]
01/25 13:26:08 Event: ULOG_EXECUTE for Condor Node MyJob (838.0)
01/25 13:26:08 BAD EVENT: job (838.0.0) executing, submit count < 1 (0)
01/25 13:26:08 BAD EVENT is warning only
01/25 13:26:08 Number of idle job procs: 2
01/25 13:26:08 Event: ULOG_JOB_TERMINATED for Condor Node MyJob (838.0)
01/25 13:26:08 BAD EVENT: job (838.0.0) ended, submit count < 1 (0)
01/25 13:26:08 BAD EVENT is warning only
01/25 13:26:08 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs >= 0)" at line 3119 in file dag.cpp
==
Note that the ULOG_EXECUTE event came in with no preliminary
ULOG_SUBMIT. I see another manifestation of this in the monolithic log
file that all my .condor files set as the "log = " destination:
==
001 (838.000.000) 01/25 13:25:49 Job executing on host: <192.168.131.114:52465>
...
005 (838.000.000) 01/25 13:25:55 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (838.000.000) 01/25 13:25:38 Job submitted from host: <192.168.129.17:53465>
DAG Node: MyJob
...
==
The executing message is there before the submitted message. The
timestamps look OK but they came into the log file out of order. The job
was very small and took just 6 seconds.
Where can I go next to get more details to find the root cause? Maybe
it's some kind of networking configuration in my data center, and
therefore fixable. Help!