HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] (no subject)



I just found a nasty condor bug.  This bug has caused a Really Big
Company's job_queue.log file to grow to over 2 Gig in size.  When the
schedd was reconfigured, shut down, or brought up, it would go out to
lunch for about 3 hours while attempting to truncate this monsterous
job_queue.log.

As you know, the schedd registers a timer for QUEUE_CLEAN_INTERVAL
(defaults to 24 hours).  When this timer fires, the schedd operates on
its own job queue file, removing all the information it doesn't need,
like completed clusters.

When the timer is registered, it's registered like this:

cleanid = daemonCore->Register_Timer(QueueCleanInterval,
		(Event)&CleanJobQueue,"CleanJobQueue");

In this form, the timer registered is a ONE-TIME TIMER.  It will fire
once, and then never again.  When the schedd is reconfigured, the timer
is re-registered, and it will fire one more time.

The fix, of course, is

cleanid = daemonCore->Register_Timer(
	QueueCleanInterval, 
	QueueCleanInterval,
	(Event)&CleanJobQueue,
      "CleanJobQueue");

While you're there, you may want to review the other timers registered
nearby - some of them use the one-time form.

Have a nice day,
Mike Yoder
Principal Member of Technical Staff
Direct : +1.408.321.9000
Fax    : +1.408.904.5992
Mobile : +1.408.497.7597
yoderm@xxxxxxxxxx

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com