[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] STARTD died due to exception ACCESS_VIOLATION



condor -version
$CondorVersion: 6.8.3 Jan  5 2007 $
$CondorPlatform: INTEL-WINNT50 $

A quadcore machine which had been happily running condor and submitted
jobs 
started to fail this morning. MasterLog, StartLog, and core files are
below.

The only changes I made to the machine were to install Microsoft .NET 
framework v2.0 and to try running an MPI client (FAH SMP Beta Client 
http://folding.stanford.edu/English/FAQ-SMP#ntoc2). 

MasterLog:
12/19 14:05:03 ** Condor (CONDOR_MASTER) STARTING UP
12/19 14:05:03 ** Z:\condor\bin\condor_master.exe
12/19 14:05:03 ** $CondorVersion: 6.8.3 Jan  5 2007 $
12/19 14:05:03 ** $CondorPlatform: INTEL-WINNT50 $
12/19 14:05:03 ** PID = 2192
12/19 14:05:03 ** Log last touched 12/19 14:02:29
12/19 14:05:03 ******************************************************
12/19 14:05:03 Using config source: Z:\condor\condor_config
12/19 14:05:03 Using local config sources: 
12/19 14:05:03    Z:/Condor/condor_config.local
12/19 14:05:03 DaemonCore: Command Socket at <136.200.32.87:2043>
12/19 14:05:03 Started DaemonCore process
"Z:/Condor/condor-6.8.3/bin/condor_startd.exe", pid and pgroup = 4036
12/19 14:05:03 Started DaemonCore process
"Z:/Condor/condor-6.8.3/bin/condor_schedd.exe", pid and pgroup = 4068
12/19 14:07:07 DaemonCore: Command received via UDP from host
<136.200.32.87:2069>
12/19 14:07:07 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
12/19 14:07:07 The STARTD (pid 4036) died due to exception
ACCESS_VIOLATION
12/19 14:07:07 Sending obituary for
"Z:/Condor/condor-6.8.3/bin/condor_startd.exe"
12/19 14:07:07 restarting Z:/Condor/condor-6.8.3/bin/condor_startd.exe
in 10 seconds
12/19 14:07:17 Started DaemonCore process
"Z:/Condor/condor-6.8.3/bin/condor_startd.exe", pid and pgroup = 3196
12/19 14:09:28 DaemonCore: Command received via UDP from host
<136.200.32.87:2208>
12/19 14:09:28 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
12/19 14:09:28 The STARTD (pid 3196) died due to exception
ACCESS_VIOLATION
12/19 14:09:28 Sending obituary for
"Z:/Condor/condor-6.8.3/bin/condor_startd.exe"
12/19 14:09:28 restarting Z:/Condor/condor-6.8.3/bin/condor_startd.exe
in 11 seconds

StartLog:
12/19 14:20:38 vm2: State change: IS_OWNER is false
12/19 14:20:38 vm2: Changing state: Owner -> Unclaimed
12/19 14:20:38 vm2: State change: IS_OWNER is TRUE
12/19 14:20:38 vm2: Changing state: Unclaimed -> Owner
12/19 14:20:38 vm2: State change: IS_OWNER is false
12/19 14:20:38 vm2: Changing state: Owner -> Unclaimed
12/19 14:20:38 vm2: State change: IS_OWNER is TRUE
12/19 14:20:38 vm2: Changing state: Unclaimed -> Owner
12/19 14:20:38 vm2: State change: IS_OWNER is false
12/19 14:20:38 vm2: Changing state: Owner -> Unclaimed
12/19 14:20:38 vm2: State change: IS_OWNER is TRUE
12/19 14:20:38 vm2: Changing state: Unclaimed -> Owner

core.STARTD.WIN32:
//=====================================================
Exception code: C00000FD STACK_OVERFLOW
Fault address:  7C809C0B 01:00008C0B C:\WINDOWS\system32\kernel32.dll

Registers:
EAX:00003F15
EBX:00000000
ECX:00000000
EDX:FFFFFFFF
ESI:00000100
EDI:00000000
CS:EIP:001B:7C809C0B
SS:ESP:0023:00032FE4  EBP:00033008
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010206