HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] (no subject)



At 02:33 PM 4/8/2005, Michael Yoder wrote:

I just found a nasty condor bug.  This bug has caused a Really Big
Company's job_queue.log file to grow to over 2 Gig in size.  When the
schedd was reconfigured, shut down, or brought up, it would go out to
lunch for about 3 hours while attempting to truncate this monsterous
job_queue.log.

As you know, the schedd registers a timer for QUEUE_CLEAN_INTERVAL
(defaults to 24 hours).  When this timer fires, the schedd operates on
its own job queue file, removing all the information it doesn't need,
like completed clusters.

When the timer is registered, it's registered like this:

cleanid = daemonCore->Register_Timer(QueueCleanInterval,
                (Event)&CleanJobQueue,"CleanJobQueue");

In this form, the timer registered is a ONE-TIME TIMER.  It will fire
once, and then never again.  When the schedd is reconfigured, the timer
is re-registered, and it will fire one more time.

The fix, of course, is

cleanid = daemonCore->Register_Timer(
        QueueCleanInterval,
        QueueCleanInterval,
        (Event)&CleanJobQueue,
      "CleanJobQueue");

While you're there, you may want to review the other timers registered
nearby - some of them use the one-time form.

Hi Mike -

Thanks for pointing out this bug. It is now fixed in the CVS source code (for ver 6.6.10 and above).

The other one-time timers in that function are OK --- they should be one-time.

best regards,
Todd



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba@xxxxxxxxxxx 1210 W. Dayton St. Rm #4257
http://www.cs.wisc.edu/~tannenba Madison, WI 53706-1685
Phone: (608) 263-7132 FAX: (608) 262-9777