[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CEDAR:6001: Failed to connect to ...



Hi,

The last entry in the CollectorLog 

> 3/30 03:04:51 Got SIGTERM. Performing graceful shutdown.
> 3/30 03:04:51 **** condor_collector.exe (condor_COLLECTOR) EXITING WITH 
> STATUS 0

dates from the 3/30 03:04:51. For the same time, the MasterLog file contains the following statement:

> 3/30 03:04:51 C:\condor/bin/condor_master.exe was modified, restarting
> C:\condor/bin/condor_master.exe.

I have found an article (https://lists.cs.wisc.edu/archive/condor-users/2006-March/msg00377.shtml) in the mailing list dealing with Windows NTFS issues with time zone changes. Might it be that this problem has not been fixed?

Best Regards / Mit freundlichen Grüßen

Thomas Laue
 
-- 
INPHO GmbH   *   Smaragdweg 1   *   70174 Stuttgart   *   Germany
phone: +49 711 2288 10  *  fax: +49 711 2288 111  *  web: www.inpho.de
place of business: Stuttgart    *   managing director: Johannes Saile
commercial register: Stuttgart, HRB 9586
Leader in Photogrammetry and Digital Surface Modelling
Please visit www.inpho.de 

-----Ursprüngliche Nachricht-----
Von: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] Im Auftrag von Steven Timm
Gesendet: Freitag, 11. April 2008 16:38
An: Condor-Users Mail List
Betreff: Re: [Condor-users] CEDAR:6001: Failed to connect to ...

What does your CollectorLog say on the machine where the collector is
running?  IT looks like the Collector didn't start at all when
condor tried to restart.

Steve Timm


On Fri, 11 Apr 2008, Thomas Laue wrote:

> Hello,
>
> I am running a condor pool with 3 servers with Windows Server 2003 and
> Condor 6.8.5.
>
> Apparantly every once in a while the condor master seems to "fail", I
> mean by this that the condor submitting nodes and condor processing
> nodes don't see the available resources anymore... if I then take a look
> at the condor master (which is btw on a seperate server) and I do a
> condor_status I get the message that the condor_collector is not
> running.
>
> "CEDAR:6001"Failed to connect to <xxx.x.x.xx:9618>, Error: Couldn't
> contact the condor_collector on xxx.xxxx.xx.xx.
>
> I know I can fix this by rebooting this machine but since it is also the
> floating license server for 5 different softwares, I can't reboot it
> without stopping all processing.
>
> Can you tell me if there is another way of resetting the condor master
> machine, without a reboot?
> I already tried condor_restart and also stopping and restarting the
> condor_master service (via the MS management console Services).  Also
> condor_off doesn't work because the condor_collector is not running at
> this point (I think).
>
> The MasterLog file contains the following messages:
>
> 3/30 03:04:51 C:\condor/bin/condor_master.exe was modified, restarting
> C:\condor/bin/condor_master.exe.
> 3/30 03:04:51 Sent signal 15 to COLLECTOR (pid 1228)
> 3/30 03:04:51 Sent signal 15 to NEGOTIATOR (pid 8068)
> 3/30 03:04:51 Sent signal 15 to SCHEDD (pid 7324)
> 3/30 03:04:52 DaemonCore: Command received via UDP from host >
> <10.40.10.150:3638>
> 3/30 03:04:52 DaemonCore: received command 60011 (DC_NOP), calling
> handler (handle_nop())
> 3/30 03:04:52 The NEGOTIATOR (pid 8068) exited with status 0
> 3/30 03:04:52 DaemonCore: Command received via UDP from host
> <10.40.10.150:3639>
> 3/30 03:04:52 DaemonCore: received command 60011 (DC_NOP), calling
> handler (handle_nop())
> 3/30 03:04:52 The COLLECTOR (pid 1228) exited with status 0
> 3/30 03:04:52 DaemonCore: Command received via UDP from host
> <10.40.10.150:3640>
> 3/30 03:04:52 DaemonCore: received command 60011 (DC_NOP), calling
> handler (handle_nop())
> 3/30 03:04:52 The SCHEDD (pid 7324) exited with status 0
> 3/30 03:04:52 All daemons are gone.  Restarting.
> 3/30 03:04:52 Restarting master in 120 seconds.
> 3/30 03:06:52 Running as NT Service = 1
> 3/30 03:06:52 Doing exec( "C:\WINDOWS\system32\cmd.exe /Q /C net stop
> Condor & net start Condor" )
> 3/30 03:06:53 ******************************************************
> 3/30 03:06:53 ** Condor (CONDOR_MASTER) STARTING UP
> 3/30 03:06:53 ** C:\condor\bin\condor_master.exe
> 3/30 03:06:53 ** $CondorVersion: 6.8.5 May 17 2007 $
> 3/30 03:06:53 ** $CondorPlatform: INTEL-WINNT50 $
> 3/30 03:06:53 ** PID = 7700
> 3/30 03:06:53 ** Log last touched 3/30 03:06:52
> 3/30 03:06:53 ******************************************************
> 3/30 03:06:53 Using config source: C:\condor\condor_config
> 3/30 03:06:53 Using local config sources:
> 3/30 03:06:53    C:\condor/condor_config.local
> 3/30 03:06:53 DaemonCore: Command Socket at < xxx.x.x.xx:xxxx >
> 3/30 03:16:58 WinFirewall: get_CurrentProfile failed: 0x800706d9
> 3/30 03:16:58 Started DaemonCore process
> "C:\condor/bin/condor_schedd.exe", pid and pgroup = 7912
> 3/30 03:17:04 attempt to connect to < xxx.x.x.xx:9618> failed: connect
> errno = 10061 connection refused.
> 3/30 03:17:04 ERROR: SECMAN:2003:TCP connection to < xxx.x.x.xx:xxxx >
> failed
>
> 3/30 03:17:04 Failed to start non-blocking update to < xxx.x.x.xx:xxxx
>> .
> 3/30 03:22:04 attempt to connect to < xxx.x.x.xx:9618> failed: connect
> errno = 10061 connection refused.
> 3/30 03:22:04 ERROR: SECMAN:2003:TCP connection to < xxx.x.x.xx:xxxx >
> failed
>
> Could anyone tell me what might be the reason of this problem and how to
> fix it. Thank you very much in advance!
>
> Thomas
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>

-- 
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/