Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] problem with failure associated with LOG LINE CACHE

Date: Thu, 3 Jun 2010 10:57:52 -0500 (CDT)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [Condor-users] problem with failure associated with LOG LINE CACHE

On Wed, 2 Jun 2010, Jameson Rollins wrote:

Hi, folks.  I'm having some trouble with a vanilla-universe DAG and I
was wondering if someone here might have some insight into what
happened.

Yesterday I submitted a 10476-job dag to our pool.  As of 7:07 this
morning, 7567 jobs were complete.  Then, at 7:08, the dagman.out read
the following:

06/02 07:08:17 LOG LINE CACHE: Begin Flush

followed by 20,000 lines of logs with slightly different date stamp
formatting, ending with:

6/2 07:08:17 ERROR: failure to read job log
       A log event may be corrupt.  DAGMan will skip the event and try to
       continue, but information may have been lost.  If DAGMan exits
       unfinished, but reports no failed jobs, re-submit the rescue file
       to complete the DAG.
6/2 07:08:17     ------------------------------
6/2 07:08:17        Condor Recovery Complete
6/2 07:08:17     ------------------------------
06/02 07:08:17 LOG LINE CACHE: End Flush
06/02 07:08:17 Stopping the caching of log lines.
06/02 07:08:17 ...done with RECOVERY mode <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
06/02 07:08:17 Of 10476 nodes total:
06/02 07:08:17  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
06/02 07:08:17   ===     ===      ===     ===     ===        ===      ===
06/02 07:08:17  1802       0      383       0       0       8262       29

Okay, it seems like there are several issues here. One is, what causedDAGMan to go into recovery mode? Did you condor_hold/condor_release theDAGMan job? If not, did the submit machine go down, or did the schedd onthat machine crash?

Note that at this point it's saying that only 1802 jobs are complete,
even though 7567 were reported complete before the LOG LINE CACHE flush
began.


That's probably because it got goofed up by the error reading the event.
The biggest question is what caused that error.

A couple seconds later I get the following:
06/02 07:08:31 ERROR: node in:76.800:947001628-947001765: job ID inuserlog submit event (71479218.0) doesn't match ID reported earlier bysubmit command (71625850.0)! Aborting DAG; setDAGMAN_ABORT_ON_SCARY_SUBMIT to false if you are *sure* this shouldn'tcause an abort.

When I tried to resubmit the rescue dag, it again came up with only 1802
complete jobs.

Yeah, that's correct for the rescue DAG, given that it only thought 1802jobs were complete when the rescue DAG was generated. I don't thinkthere's a problem with the rescue DAG itself.

Anyone have any idea what happened here?  I've never seen anything like
this before.  I don't know what the "LOG LINE CACHE" is, nor why the
status of the dag would appear so messed up after the "flush" was
complete.

I found mention of fixed bug sounding very similar to this in the
release notes for 7.1.4 [0], but we're using 7.4.2.


A few more questions relating to the event-reading error:

* Is it possible that you have some other Condor jobs that are using someof the log files used by the jobs in your DAG? That might cause theproblem that aborted the DAG.


* Are any of your job log files on NFS?

* Is the in:76.800:947001628-947001765 one of the ones that actually wasfinished, but not considered finshed by DAGMan? (You can tell by checkingfor a DONE at the end of that JOB line in the rescue DAG.)

You might also want to try running condor_check_userlogs on the set of alllog files used by jobs in your DAG, especially the log file used by the

in:76.800:947001628-947001765 job.

Kent Wenger
Condor Team

Follow-Ups:
- Re: [Condor-users] problem with failure associated with LOG LINE CACHE
  - From: Jameson Rollins

References:
- [Condor-users] problem with failure associated with LOG LINE CACHE
  - From: Jameson Rollins

Prev by Date: [Condor-users] non-existing WANT_PREEMPT set by 7.4.2 msi installer
Next by Date: [Condor-users] Assertion ERROR on (result)" at line 655 in file pseudo_ops.cpp
Previous by thread: [Condor-users] problem with failure associated with LOG LINE CACHE
Next by thread: Re: [Condor-users] problem with failure associated with LOG LINE CACHE
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] problem with failure associated with LOG LINE CACHE