[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] unable to remove jobs stuck in X state



Update... 

Deleting the job_queue.log (and all clusterX.procX.* directories) from the SPOOL directory on the submit machine and then restarting condor master clears everything from the queue, although it does reset job numbering & history to 1.

While this works it seems like a real sledgehammer tactic for something that can probably be done more selectively. Any ideas?

Steve
________________________________________
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steven Platt
Sent: 05 August 2011 11:58
To: Condor-Users Mail List
Subject: [Condor-users] unable to remove jobs stuck in X state

Hello,
Here's the thing ... A normally successful user has submitted 3 vanilla jobs (4461, 4462 & 4463), each of ~180 processes. The first two had bad inputs and were condor_rm'd. They are now stuck in the X state with 4463.xxx jobs sitting in Idle. Trying to forceX remove the jobs is unsuccessful...

$ condor_rm -debug -forcex 4461
8/5 11:36:21 condor_read(): timeout reading 5 bytes from <xxx.xxx.147.62:45392>.
8/5 11:36:21 IO: Failed to read packet header
8/5 11:36:41 condor_read(): timeout reading 5 bytes from <xxx.xxx.147.62:45392>.
8/5 11:36:41 IO: Failed to read packet header
8/5 11:36:41 AUTHENTICATE: handshake failed!
8/5 11:36:41 DCSchedd: authentication failure: AUTHENTICATE:1002:Failure performing handshake
AUTHENTICATE:1002:Failure performing handshake
Couldn't find/remove all jobs in cluster 4461.

...and analysis of the Idle jobs isn't much clearer...

$ condor_q 4463.1 -better-analyze
-- Quill: quill@xxxxxxxxxxxxxxxxxxxx : <xxx.xxx.147.62:5432> : quill---
4463.001:  Run analysis summary.  Of 49 machines,
      0 are rejected by your job's requirements
      1 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
     48 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
I admit that we have standard network cabling connecting the nodes (1 master, 8 nodes, 48 slots) so it might be crap IO, although this hasn't prevented jobs running over the last couple of years.
Does anyone have any pointers for investigating this?

Thanks

Steve
Health Protection Agency
UK
[Condor 7.0.5 running on Rocks 5.1]
-----------------------------------------
**************************************************************************
The information contained in the EMail and any attachments is
confidential and intended solely and for the attention and use of
the named addressee(s). It may not be disclosed to any other person
without the express authority of the HPA, or the intended
recipient, or both. If you are not the intended recipient, you must
not disclose, copy, distribute or retain this message or any part
of it. This footnote also confirms that this EMail has been swept
for computer viruses, but please re-sweep any attachments before
opening or saving. HTTP://www.HPA.org.uk
**************************************************************************