Hi, folks. I'm having some trouble with a vanilla-universe DAG and I was wondering if someone here might have some insight into what happened. Yesterday I submitted a 10476-job dag to our pool. As of 7:07 this morning, 7567 jobs were complete. Then, at 7:08, the dagman.out read the following: 06/02 07:08:17 LOG LINE CACHE: Begin Flush followed by 20,000 lines of logs with slightly different date stamp formatting, ending with: 6/2 07:08:17 ERROR: failure to read job log A log event may be corrupt. DAGMan will skip the event and try to continue, but information may have been lost. If DAGMan exits unfinished, but reports no failed jobs, re-submit the rescue file to complete the DAG. 6/2 07:08:17 ------------------------------ 6/2 07:08:17 Condor Recovery Complete 6/2 07:08:17 ------------------------------ 06/02 07:08:17 LOG LINE CACHE: End Flush 06/02 07:08:17 Stopping the caching of log lines. 06/02 07:08:17 ...done with RECOVERY mode <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 06/02 07:08:17 Of 10476 nodes total: 06/02 07:08:17 Done Pre Queued Post Ready Un-Ready Failed 06/02 07:08:17 === === === === === === === 06/02 07:08:17 1802 0 383 0 0 8262 29 Note that at this point it's saying that only 1802 jobs are complete, even though 7567 were reported complete before the LOG LINE CACHE flush began. A couple seconds later I get the following: 06/02 07:08:31 ERROR: node in:76.800:947001628-947001765: job ID in userlog submit event (71479218.0) doesn't match ID reported earlier by submit command (71625850.0)! Aborting DAG; set DAGMAN_ABORT_ON_SCARY_SUBMIT to false if you are *sure* this shouldn't cause an abort. 06/02 07:08:31 Aborting DAG... 06/02 07:08:31 Writing Rescue DAG to injections.dag.rescue001... 06/02 07:08:32 Removing submitted jobs... 06/02 07:08:32 Removing any/all submitted Condor/Stork jobs... 06/02 07:08:32 Running: condor_rm -const DAGManJobId' '==' '71479114 06/02 07:08:32 Note: 0 total job deferrals because of -MaxJobs limit (2500) 06/02 07:08:32 Note: 0 total job deferrals because of -MaxIdle limit (50) 06/02 07:08:32 Note: 0 total job deferrals because of node category throttles 06/02 07:08:32 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 06/02 07:08:32 Note: 0 total POST script deferrals because of -MaxPost limit (0) 06/02 07:08:32 Warning: ReadMultipleUserLogs destructor called, but still monitoring 12 log(s)! 06/02 07:08:32 **** condor_scheduniv_exec.71479114.0 (condor_DAGMAN) pid 13492 EXITING WITH STATUS 1 When I tried to resubmit the rescue dag, it again came up with only 1802 complete jobs. Anyone have any idea what happened here? I've never seen anything like this before. I don't know what the "LOG LINE CACHE" is, nor why the status of the dag would appear so messed up after the "flush" was complete. I found mention of fixed bug sounding very similar to this in the release notes for 7.1.4 [0], but we're using 7.4.2. Any insight would be much appreciated. jamie. [0] http://www.cs.wisc.edu/condor/manual/v7.1/8_3Development_Release.html : Fixed a bug where if log line caching was enabled in condor_dagman and condor_dagman failed during the recovery process, the cache would stay active. Now the cache is disabled in all cases at the end of recovery.
Attachment:
pgpH4u5sDNGex.pgp
Description: PGP signature