We had some central managers pointing to a view server host with no problems.
Added another CM (6.6.10 on RHES4) and all was OK (call this condor1).
View server machine died.
condor1 stilll OK.
Added another CM (6.6.10 on RHES4) and condor_status hangs (call this condor2).
Telnet to port 9618 doesn't exit after 3 <CR>'s (as it should do).
restart condor1 and it now doesn't work properly either, the same as condor2.
Comment out CONDOR_VIEW_HOST in both config files.
/etc/rc.d/init.d/condor stop doesn't work, must kill -9 all condor processes.
restart condor and all is OK again.
Is this to be expected?
Cheers
Greg
P.S. Excerpts from CollectorLog
8/16 13:35:50 Can't connect to <130.116.146.129:9618>:0, errno = 110
8/16 13:35:50 Will keep trying for 10 seconds...
8/16 13:35:51 Connect failed for 10 seconds; returning FALSE
8/16 13:35:51 ERROR:
SECMAN:2003:TCP connection to <130.116.146.129:9618> failed
8/16 13:35:51 Can't send command 0 to View Collector
8/16 13:35:51 condor_write(): Socket closed when trying to write buffer
8/16 13:35:51 Buf::write(): condor_write() failed
8/16 13:35:51 SECMAN: Error sending response classad!
8/16 13:35:51 IO: Incoming packet is too big
8/16 13:35:51 DaemonCore: Can't receive command request (perhaps a timeout?)
8/16 13:35:51 condor_write(): Socket closed when trying to write buffer
8/16 13:35:51 Buf::write(): condor_write() failed
8/16 13:35:51 SECMAN: Error sending response classad!
-----------------------------------------------------------------------
Greg Hitchen greg.hitchen@xxxxxxxx
CSIRO Exploration and Mining phone:+61 8 6436 8663
Australian Resources Research Centre (ARRC) fax: +61 8 6436 8555
Postal address: mob: 0407 952 748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-----------------------------------------------------------------------