Hello Ben,
BB> Hi Pavel:
BB> >>> For the first machine: When I start condor service two processes are started - condor_master and condor_startd. But in several seconds after start (10-15 sec) condor_startd dies and condor_master became consume 50% of CPU. After that I can't stop condor service. When I try to do this - I receive an error message about unable to stop service due to exceeded response time. It should be noted that condor_status on the central manager doesn't show this machine in the list neither when the service is "running", nor after my attempt to stop it.
BB> <<<
BB> Can you post the master and startd logs? (Preferably with debugging turned up to, say, D_FULLDEBUG.) Also, when the startd dies, does it leave a core file behind? If so, please post that too.
Here they are.
MasterLog:
========
3/17 12:16:22 UnsetEnv(NET_REMAP_ENABLE): SetEnvironmentVariable failed, errno=203
3/17 12:16:22 WARNING: Config source is empty: C:\condor/condor_config.local
3/17 12:16:22 ******************************************************
3/17 12:16:22 ** Condor (CONDOR_MASTER) STARTING UP
3/17 12:16:22 ** C:\condor\bin\condor_master.exe
3/17 12:16:22 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
3/17 12:16:22 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
3/17 12:16:22 ** $CondorVersion: 7.2.1 Feb 19 2009 BuildID: 133382 $
3/17 12:16:22 ** $CondorPlatform: INTEL-WINNT50 $
3/17 12:16:22 ** PID = 3488
3/17 12:16:22 ** Log last touched time unavailable (No such file or directory)
3/17 12:16:22 ******************************************************
3/17 12:16:22 Using config source: C:\condor\condor_config
3/17 12:16:22 Using local config sources:
3/17 12:16:22 C:\condor/condor_config.local
3/17 12:16:22 DaemonCore: Command Socket at <195.209.147.39:1033>
3/17 12:16:22 Will use UDP to update collector n37.keldysh.ru <195.209.147.37:9618>
3/17 12:16:22 Log file not found in config file: AGENTD_LOG
3/17 12:16:22 Authorized application C:\condor/bin/condor_startd.exe is now enabled in the firewall.
3/17 12:16:22 Authorized application C:\condor/bin\condor_dagman.exe is now enabled in the firewall.
3/17 12:16:22 ::RealStart; STARTD >
3/17 12:16:22 GetBinaryType() returned 0
3/17 12:16:22 Started DaemonCore process "C:\condor/bin/condor_startd.exe", pid and pgroup = 3692
3/17 12:16:22 ::RealStart; AGENTD >
3/17 12:16:22 GetBinaryType() returned 0
3/17 12:16:22 Started process "C:\condor/agentd/agentd.exe", pid and pgroup = 3716
3/17 12:16:22 Getting monitoring info for pid 3488
3/17 12:16:27 enter Daemons::UpdateCollector
3/17 12:16:27 Trying to update collector <195.209.147.37:9618>
3/17 12:16:27 Attempting to send update via UDP to collector n37.keldysh.ru <195.209.147.37:9618>
3/17 12:16:27 File descriptor limits: max 1024, safe 820
3/17 12:16:27 exit Daemons::UpdateCollector
3/17 12:16:27 enter Daemons::CheckForNewExecutable
3/17 12:16:27 Time stamp of running C:\condor/bin/condor_master.exe: 1234996426
3/17 12:16:27 GetTimeStamp returned: 1234996426
3/17 12:16:27 Time stamp of running C:\condor/bin/condor_startd.exe: 1234996492
3/17 12:16:27 GetTimeStamp returned: 1234996492
3/17 12:16:27 Time stamp of running C:\condor/agentd/agentd.exe: 1233239202
3/17 12:16:27 GetTimeStamp returned: 1233239202
3/17 12:16:27 exit Daemons::CheckForNewExecutable
3/17 12:16:27 Initialized the following authorization table:
3/17 12:16:27 Authorizations yet to be resolved:
3/17 12:16:27 allow NEGOTIATOR: */195.209.147.37 */n37.keldysh.ru
3/17 12:16:27 allow ADMINISTRATOR: */195.209.147.37 */n37.keldysh.ru
3/17 12:16:27 allow OWNER: */n39.keldysh.ru */195.209.147.37 */n37.keldysh.ru */195.209.147.39
3/17 12:16:33 The STARTD (pid 3692) exited with status 0
3/17 12:16:33 ProcAPI: pid # 3692 was not found (OpenProcess err=720)
3/17 12:16:33 ProcAPI: pid # 3692 was not found (OpenProcess err=720)
3/17 12:16:33 restarting C:\condor/bin/condor_startd.exe in 10 seconds
3/17 12:16:33 enter Daemons::UpdateCollector
3/17 12:16:33 Trying to update collector <195.209.147.37:9618>
3/17 12:16:33 Attempting to send update via UDP to collector n37.keldysh.ru <195.209.147.37:9618>
========
StartLog:
========
3/17 12:16:22 WARNING: Config source is empty: C:\condor/condor_config.local
3/17 12:16:22 ******************************************************
3/17 12:16:22 ** condor_startd.exe (CONDOR_STARTD) STARTING UP
3/17 12:16:22 ** C:\condor\bin\condor_startd.exe
3/17 12:16:22 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
3/17 12:16:22 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
3/17 12:16:22 ** $CondorVersion: 7.2.1 Feb 19 2009 BuildID: 133382 $
3/17 12:16:22 ** $CondorPlatform: INTEL-WINNT50 $
3/17 12:16:22 ** PID = 3692
3/17 12:16:22 ** Log last touched time unavailable (No such file or directory)
3/17 12:16:22 ******************************************************
3/17 12:16:22 Using config source: C:\condor\condor_config
3/17 12:16:22 Using local config sources:
3/17 12:16:22 C:\condor/condor_config.local
3/17 12:16:22 DaemonCore: Command Socket at
3/17 12:16:22 Will use UDP to update collector n37.keldysh.ru <195.209.147.37:9618>
3/17 12:16:22 Memory: Detected 3574 megs RAM
3/17 12:16:22 doInitialize() failed for
3/17 12:16:22 No usable network interface: hibernation disabled
3/17 12:16:23 my_popen: CreateProcess failed
3/17 12:16:23 Failed to execute C:\condor/bin/condor_starter.std.exe, ignoring
3/17 12:16:23 command_x_event() called.
3/17 12:16:23 slot1: New machine resource allocated
3/17 12:16:23 slot2: New machine resource allocated
3/17 12:16:23 Instantiating a StartdHookMgr
3/17 12:16:23 UidDomain = "n39.keldysh.ru"
3/17 12:16:23 FileSystemDomain = "n39.keldysh.ru"
3/17 12:16:23 Swap space: 4194303
3/17 12:16:28 no loadavg samples this minute, maybe thread died???
3/17 12:16:28 slot1: Total execute space: 32051372
3/17 12:16:28 slot2: Total execute space: 32051372
3/17 12:16:28 About to run initial benchmarks.
3/17 12:16:28 About to compute mips
3/17 12:16:28 Computed mips: 7297
3/17 12:16:28 About to compute kflops
3/17 12:16:33 Computed kflops: 1629489
3/17 12:16:33 recalc:DHRY_MIPS=7297, CLINPACK KFLOPS=1629489
3/17 12:16:33 Completed initial benchmarks.
3/17 12:16:33 CronMgr: Constructing 'startd'
3/17 12:16:33 CronMgr: Setting name to 'startd'
3/17 12:16:33 CronMgr: Setting parameter base to 'startd'
3/17 12:16:33 CronMgr: Doing config (initial)
3/17 12:16:33 command_x_event() called.
3/17 12:16:33 slot2: State change: IS_OWNER is false
3/17 12:16:33 slot2: Changing state: Owner -> Unclaimed
3/17 12:16:33 slot1: State change: IS_OWNER is false
3/17 12:16:33 slot1: Changing state: Owner -> Unclaimed
3/17 12:16:33 ERROR "select, error # = 10038" at line 2719 in file ..\src\condor_daemon_core.V6\daemon_core.cpp
3/17 12:16:33 CronMgr: 0 jobs alive
3/17 12:16:33 Deleting Cronmgr
3/17 12:16:33 StartdCronMgr: Shutting down
3/17 12:16:33 CronMgr: Killing all jobs
3/17 12:16:33 StartdCronMgr: Bye
3/17 12:16:33 CronMgr: bye
3/17 12:16:33 About to send final update to the central manager
3/17 12:16:33 Trying to update collector <195.209.147.37:9618>
3/17 12:16:33 Attempting to send update via UDP to collector n37.keldysh.ru <195.209.147.37:9618>
3/17 12:16:33 Initialized the following authorization table:
3/17 12:16:33 Authorizations yet to be resolved:
3/17 12:16:33 allow READ: */*
3/17 12:16:33 allow WRITE: */*
3/17 12:16:33 allow NEGOTIATOR: */195.209.147.37 */n37.keldysh.ru
3/17 12:16:33 allow ADMINISTRATOR: */195.209.147.37 */n37.keldysh.ru
3/17 12:16:33 allow OWNER: */n39.keldysh.ru */195.209.147.37 */n37.keldysh.ru */195.209.147.39
3/17 12:16:33 allow DAEMON: */*
3/17 12:16:33 allow ADVERTISE_STARTD: */*
3/17 12:16:33 allow ADVERTISE_SCHEDD: */*
3/17 12:16:33 allow ADVERTISE_MASTER: */*
3/17 12:16:33 Trying to update collector <195.209.147.37:9618>
3/17 12:16:33 Attempting to send update via UDP to collector n37.keldysh.ru <195.209.147.37:9618>
3/17 12:16:33 Deleting the StartdHookMgr
3/17 12:16:33 All resources are free, exiting.
3/17 12:16:33 **** condor_startd.exe (condor_STARTD) pid 3692 EXITING WITH STATUS 0
========
When the startd dies, ".startd_address", ".startd_claim_id.slot1" and ".startd_claim_id.slot2" files disappear.
BB> Is there something you are doing in your configuration file that is different than the other machines?
BB> Regards, -B
BB>
There is a one difference - java location. Also I found the message in the StarterLog "3/17 12:16:23 JavaDetect: failure status 1 when executing C:\PROGRA~1\Java\jre6\bin\JAVA.EXE -Xmx1024m1787m -classpath C:\condor/lib;C:\condor/lib/scimark2lib.jar;. CondorJavaInfo old 2".
condor_config contains "JAVA = C:\PROGRA~1\Java\jre6\bin\JAVA.EXE". My Java is located in "C:\Program Files\Java\jre6\bin". I tryed to change original "JAVA = C:\PROGRA~1\Java\jre6\bin\JAVA.EXE" in the condor_config to "C:\Program Files\Java\jre6\bin" but without success.
Thanks for response.
--
Pavel