Hello,
Here’s the thing … A normally successful user has submitted 3 vanilla jobs (4461, 4462 & 4463), each of ~180 processes. The first two had bad inputs and were condor_rm’d. They are now stuck in the X state with 4463.xxx jobs sitting in Idle. Trying to forceX remove the jobs is unsuccessful…
$ condor_rm -debug -forcex 4461
8/5 11:36:21 condor_read(): timeout reading 5 bytes from <xxx.xxx.147.62:45392>.
8/5 11:36:21 IO: Failed to read packet header
8/5 11:36:41 condor_read(): timeout reading 5 bytes from <xxx.xxx.147.62:45392>.
8/5 11:36:41 IO: Failed to read packet header
8/5 11:36:41 AUTHENTICATE: handshake failed!
8/5 11:36:41 DCSchedd: authentication failure: AUTHENTICATE:1002:Failure performing handshake
AUTHENTICATE:1002:Failure performing handshake
Couldn't find/remove all jobs in cluster 4461.
…and analysis of the Idle jobs isn’t much clearer…
$ condor_q 4463.1 -better-analyze
-- Quill: quill@xxxxxxxxxxxxxxxxxxxx : <xxx.xxx.147.62:5432> : quill---
4463.001: Run analysis summary. Of 49 machines,
0 are rejected by your job's requirements
1 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
48 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
I admit that we have standard network cabling connecting the nodes (1 master, 8 nodes, 48 slots) so it might be crap IO, although this hasn’t prevented jobs running over the last couple of years.
Does anyone has any pointers for investigating this?
Thanks
Health Protection Agency
UK
[Condor 7.0.5 running on Rocks 5.1]
************************************************************************** The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of the HPA, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses, but please re-sweep any attachments before opening or saving. HTTP://www.HPA.org.uk **************************************************************************