[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How did I get zombies?



> I have a cluster of 6.6.9 on W2k3.  I have several jobs that were
running
> and we removed (condor_rm), but after removal they stayed as an 'X' in
the
> queue.  An analysis of the queue said they were being removed.  While
in
> this state, the node's they were on were stuck being claimed with idle
> status.  After leaving it a week I did a condor_rm -forcex.  Now that
> removed them from the queue, but the nodes are still claimed.  Looking
in
> the schedd log I have this
> 
> Zombie process has not been cleaned up by reaper - pid 1300

Could be a condor bug.  Looks like the schedd is detecting the situation
- it knows there ought to be a zombie - but it isn't *doing* anything
about it.  Interestingly, code to do something about it *used* to be
there, but is now commented out.

> How can I get the nodes unclaimed?   Later I'll try to figure out how
I
> got into this problem.

Restart the schedd. 

condor_restart -name <machine> -schedd

Make sure you're on a machine that has HOSTALLOW_ADMINISTRATOR privs.

Mike Yoder
Principal Member of Technical Staff
Ask Mike: http://docs.optena.com
Direct  : +1.408.321.9000
Fax     : +1.408.321.9030
Mobile  : +1.408.497.7597
yoderm@xxxxxxxxxx

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com