Hi, folks. I'm having some trouble with a vanilla-universe DAG and I
was wondering if someone here might have some insight into what
happened.
Yesterday I submitted a 10476-job dag to our pool. As of 7:07 this
morning, 7567 jobs were complete. Then, at 7:08, the dagman.out read
the following:
06/02 07:08:17 LOG LINE CACHE: Begin Flush
followed by 20,000 lines of logs with slightly different date stamp
formatting, ending with:
6/2 07:08:17 ERROR: failure to read job log
A log event may be corrupt. DAGMan will skip the event and try to
continue, but information may have been lost. If DAGMan exits
unfinished, but reports no failed jobs, re-submit the rescue file
to complete the DAG.
6/2 07:08:17 ------------------------------
6/2 07:08:17 Condor Recovery Complete
6/2 07:08:17 ------------------------------
06/02 07:08:17 LOG LINE CACHE: End Flush
06/02 07:08:17 Stopping the caching of log lines.
06/02 07:08:17 ...done with RECOVERY mode <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
06/02 07:08:17 Of 10476 nodes total:
06/02 07:08:17 Done Pre Queued Post Ready Un-Ready Failed
06/02 07:08:17 === === === === === === ===
06/02 07:08:17 1802 0 383 0 0 8262 29
Note that at this point it's saying that only 1802 jobs are complete,
even though 7567 were reported complete before the LOG LINE CACHE flush
began.
A couple seconds later I get the following:
06/02 07:08:31 ERROR: node in:76.800:947001628-947001765: job ID in userlog submit event (71479218.0) doesn't match ID reported earlier by submit command (71625850.0)! Aborting DAG; set DAGMAN_ABORT_ON_SCARY_SUBMIT to false if you are *sure* this shouldn't cause an abort.
06/02 07:08:31 Aborting DAG...
06/02 07:08:31 Writing Rescue DAG to injections.dag.rescue001...
06/02 07:08:32 Removing submitted jobs...
06/02 07:08:32 Removing any/all submitted Condor/Stork jobs...
06/02 07:08:32 Running: condor_rm -const DAGManJobId' '==' '71479114
06/02 07:08:32 Note: 0 total job deferrals because of -MaxJobs limit (2500)
06/02 07:08:32 Note: 0 total job deferrals because of -MaxIdle limit (50)
06/02 07:08:32 Note: 0 total job deferrals because of node category throttles
06/02 07:08:32 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
06/02 07:08:32 Note: 0 total POST script deferrals because of -MaxPost limit (0)
06/02 07:08:32 Warning: ReadMultipleUserLogs destructor called, but still monitoring 12 log(s)!
06/02 07:08:32 **** condor_scheduniv_exec.71479114.0
(condor_DAGMAN) pid 13492 EXITING WITH STATUS 1
When I tried to resubmit the rescue dag, it again came up with only 1802
complete jobs.
Anyone have any idea what happened here? I've never seen anything like
this before. I don't know what the "LOG LINE CACHE" is, nor why the
status of the dag would appear so messed up after the "flush" was
complete.
I found mention of fixed bug sounding very similar to this in the
release notes for 7.1.4 [0], but we're using 7.4.2.
Any insight would be much appreciated.
jamie.
[0] http://www.cs.wisc.edu/condor/manual/v7.1/8_3Development_Release.html :
Fixed a bug where if log line caching was enabled in condor_dagman and
condor_dagman failed during the recovery process, the cache would stay
active. Now the cache is disabled in all cases at the end of recovery.
Attachment:
pgpH4u5sDNGex.pgp
Description: PGP signature