Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] problem with failure associated with LOG LINE CACHE
- Date: Thu, 3 Jun 2010 10:57:52 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] problem with failure associated with LOG LINE CACHE
On Wed, 2 Jun 2010, Jameson Rollins wrote:
Hi, folks. I'm having some trouble with a vanilla-universe DAG and I
was wondering if someone here might have some insight into what
happened.
Yesterday I submitted a 10476-job dag to our pool. As of 7:07 this
morning, 7567 jobs were complete. Then, at 7:08, the dagman.out read
the following:
06/02 07:08:17 LOG LINE CACHE: Begin Flush
followed by 20,000 lines of logs with slightly different date stamp
formatting, ending with:
6/2 07:08:17 ERROR: failure to read job log
A log event may be corrupt. DAGMan will skip the event and try to
continue, but information may have been lost. If DAGMan exits
unfinished, but reports no failed jobs, re-submit the rescue file
to complete the DAG.
6/2 07:08:17 ------------------------------
6/2 07:08:17 Condor Recovery Complete
6/2 07:08:17 ------------------------------
06/02 07:08:17 LOG LINE CACHE: End Flush
06/02 07:08:17 Stopping the caching of log lines.
06/02 07:08:17 ...done with RECOVERY mode <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
06/02 07:08:17 Of 10476 nodes total:
06/02 07:08:17 Done Pre Queued Post Ready Un-Ready Failed
06/02 07:08:17 === === === === === === ===
06/02 07:08:17 1802 0 383 0 0 8262 29
Okay, it seems like there are several issues here. One is, what caused
DAGMan to go into recovery mode? Did you condor_hold/condor_release the
DAGMan job? If not, did the submit machine go down, or did the schedd on
that machine crash?
Note that at this point it's saying that only 1802 jobs are complete,
even though 7567 were reported complete before the LOG LINE CACHE flush
began.
That's probably because it got goofed up by the error reading the event.
The biggest question is what caused that error.
A couple seconds later I get the following:
06/02 07:08:31 ERROR: node in:76.800:947001628-947001765: job ID in
userlog submit event (71479218.0) doesn't match ID reported earlier by
submit command (71625850.0)! Aborting DAG; set
DAGMAN_ABORT_ON_SCARY_SUBMIT to false if you are *sure* this shouldn't
cause an abort.
When I tried to resubmit the rescue dag, it again came up with only 1802
complete jobs.
Yeah, that's correct for the rescue DAG, given that it only thought 1802
jobs were complete when the rescue DAG was generated. I don't think
there's a problem with the rescue DAG itself.
Anyone have any idea what happened here? I've never seen anything like
this before. I don't know what the "LOG LINE CACHE" is, nor why the
status of the dag would appear so messed up after the "flush" was
complete.
I found mention of fixed bug sounding very similar to this in the
release notes for 7.1.4 [0], but we're using 7.4.2.
A few more questions relating to the event-reading error:
* Is it possible that you have some other Condor jobs that are using some
of the log files used by the jobs in your DAG? That might cause the
problem that aborted the DAG.
* Are any of your job log files on NFS?
* Is the in:76.800:947001628-947001765 one of the ones that actually was
finished, but not considered finshed by DAGMan? (You can tell by checking
for a DONE at the end of that JOB line in the rescue DAG.)
You might also want to try running condor_check_userlogs on the set of all
log files used by jobs in your DAG, especially the log file used by the
in:76.800:947001628-947001765 job.
Kent Wenger
Condor Team