[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Debugging condor_preen problems



I have a handful of new Windows machines that were running jobs fine for
a time and then slowly, one by one, the VM's stopped accepting new jobs.
They're just sitting there in the U+I state. Looking in on the box it
appears the disk where executions are occurring is full to the point
where new jobs can't find enough space on the box to execute. These are
big disk drives (80GB) so there's a lot of accumulated left over junk.
Hundreds of dir_<pid> directories are left in my d:\abc\condor\execute
directory.

I noticed in my MasterLog on the machine that condor_preen was running,
but only taking about 5 seconds to execute:

8/21 11:51:59 Preen pid is 624
8/21 11:52:04 DaemonCore: Command received via UDP from host
<137.57.203.140:3104>
8/21 11:52:04 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
8/21 11:52:04 Child 624 died, but not a daemon -- Ignored

So I ran it by hand with '-r -m -v' and it certainly found all the left
over execute directories in d:\abc\condor\execute but for every
directory it said OK instead of removing the directory. My preen
settings are:

PREEN_INTERVAL = 1800
PREEN_ARGS = -r -m -v
PREEN_ADMIN = $(CONDOR_ADMIN)
VALID_SPOOL_FILES = job_queue.log, job_queue.log.tmp, history,
Accountant.log, Accountantnew.log
INVALID_LOG_FILES = core

Pretty standard. And preen is working on many other Windows machines.
Just not these.

Is there a way to get more information out of condor_preen so I can
figure out why it thinks these hundreds of left over directories are
okay and is not cleaning up the junk?

- Ian

--
Ian R. Chesal <ichesal@xxxxxxxxxx>
Senior Software Engineer

Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300