[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CEDAR:6001: Failed to connect to ...



What does your CollectorLog say on the machine where the collector is
running?  IT looks like the Collector didn't start at all when
condor tried to restart.

Steve Timm


On Fri, 11 Apr 2008, Thomas Laue wrote:

Hello,

I am running a condor pool with 3 servers with Windows Server 2003 and
Condor 6.8.5.

Apparantly every once in a while the condor master seems to "fail", I
mean by this that the condor submitting nodes and condor processing
nodes don't see the available resources anymore... if I then take a look
at the condor master (which is btw on a seperate server) and I do a
condor_status I get the message that the condor_collector is not
running.

"CEDAR:6001"Failed to connect to <xxx.x.x.xx:9618>, Error: Couldn't
contact the condor_collector on xxx.xxxx.xx.xx.

I know I can fix this by rebooting this machine but since it is also the
floating license server for 5 different softwares, I can't reboot it
without stopping all processing.

Can you tell me if there is another way of resetting the condor master
machine, without a reboot?
I already tried condor_restart and also stopping and restarting the
condor_master service (via the MS management console Services).  Also
condor_off doesn't work because the condor_collector is not running at
this point (I think).

The MasterLog file contains the following messages:

3/30 03:04:51 C:\condor/bin/condor_master.exe was modified, restarting
C:\condor/bin/condor_master.exe.
3/30 03:04:51 Sent signal 15 to COLLECTOR (pid 1228)
3/30 03:04:51 Sent signal 15 to NEGOTIATOR (pid 8068)
3/30 03:04:51 Sent signal 15 to SCHEDD (pid 7324)
3/30 03:04:52 DaemonCore: Command received via UDP from host >
<10.40.10.150:3638>
3/30 03:04:52 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
3/30 03:04:52 The NEGOTIATOR (pid 8068) exited with status 0
3/30 03:04:52 DaemonCore: Command received via UDP from host
<10.40.10.150:3639>
3/30 03:04:52 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
3/30 03:04:52 The COLLECTOR (pid 1228) exited with status 0
3/30 03:04:52 DaemonCore: Command received via UDP from host
<10.40.10.150:3640>
3/30 03:04:52 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
3/30 03:04:52 The SCHEDD (pid 7324) exited with status 0
3/30 03:04:52 All daemons are gone.  Restarting.
3/30 03:04:52 Restarting master in 120 seconds.
3/30 03:06:52 Running as NT Service = 1
3/30 03:06:52 Doing exec( "C:\WINDOWS\system32\cmd.exe /Q /C net stop
Condor & net start Condor" )
3/30 03:06:53 ******************************************************
3/30 03:06:53 ** Condor (CONDOR_MASTER) STARTING UP
3/30 03:06:53 ** C:\condor\bin\condor_master.exe
3/30 03:06:53 ** $CondorVersion: 6.8.5 May 17 2007 $
3/30 03:06:53 ** $CondorPlatform: INTEL-WINNT50 $
3/30 03:06:53 ** PID = 7700
3/30 03:06:53 ** Log last touched 3/30 03:06:52
3/30 03:06:53 ******************************************************
3/30 03:06:53 Using config source: C:\condor\condor_config
3/30 03:06:53 Using local config sources:
3/30 03:06:53    C:\condor/condor_config.local
3/30 03:06:53 DaemonCore: Command Socket at < xxx.x.x.xx:xxxx >
3/30 03:16:58 WinFirewall: get_CurrentProfile failed: 0x800706d9
3/30 03:16:58 Started DaemonCore process
"C:\condor/bin/condor_schedd.exe", pid and pgroup = 7912
3/30 03:17:04 attempt to connect to < xxx.x.x.xx:9618> failed: connect
errno = 10061 connection refused.
3/30 03:17:04 ERROR: SECMAN:2003:TCP connection to < xxx.x.x.xx:xxxx >
failed

3/30 03:17:04 Failed to start non-blocking update to < xxx.x.x.xx:xxxx
.
3/30 03:22:04 attempt to connect to < xxx.x.x.xx:9618> failed: connect
errno = 10061 connection refused.
3/30 03:22:04 ERROR: SECMAN:2003:TCP connection to < xxx.x.x.xx:xxxx >
failed

Could anyone tell me what might be the reason of this problem and how to
fix it. Thank you very much in advance!

Thomas

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.