Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Windows condor_quill access violation exception errors
- Date: Thu, 29 Oct 2009 10:01:48 +0800
- From: <Greg.Hitchen@xxxxxxxx>
- Subject: [Condor-users] Windows condor_quill access violation exception errors
Hi
All
We have recently
added the Quill database system/setup to our pools of Condor
machines.
Quick summary:
5 pools each with ~
600 windows PCs (mainly XP) running Condor version 7.2.4
5 central managers,
1 Condorview server and 1 Quill database server, each a VM on ESX
servers
and running
x86_64 bit SLES10 with Condor version 7.2.3
The setup and
install went mostly OK and we have been going for only a week with ~ 16
windows
submit nodes all
running with condor_quill. We have so far had 3 of these machines start to
have
access violation
errors with condor_quill. We get emails with the header:
[Condor] Problem
PI-SCHAP2-SL.nexus.csiro.au: condor_quill.exe died
(-1073741819)
See below for
excerpts from MasterLog, QuillLog and the core file
core.WIN32.QUILL
Deleting and
recopying just the condor_quill.exe file made no difference. In each case an
uninstall
and reinstall seemed
to fix things up. Has anyone else come across this? I'm worried that it
just
seems to have
randomly started happening for no apparent reason.
On top of this we
are also having some where condor stops altogether giving
emails:
[Condor] Problem
ELEMENT-KB.arrc.csiro.au: condor_quill.exe exited (44)
With MasterLog
showing quilld and schedd exiting failures with condor_mail
and
condor_schedd.exe
not a valid windows executable! :
MasterLog for exit
code 44 and condor stopping, no daemons running at all.
10/29 04:56:08
DaemonCore: pid 3144 exited with status 44, invoking reaper 1
<Daemons::DefaultReaper()>
10/29 04:56:08 The QUILL (pid 3144) exited
with status 44
10/29 04:56:08 restarting
C:\PROGRA~1\condor/bin/condor_quill.exe in 3600 seconds
10/29 04:56:08
DaemonCore: return from reaper for pid 3144
10/29 04:56:14 Received UDP
command 60008 (DC_CHILDALIVE) from <130.116.144.59:9738>, access
level DAEMON
10/29 04:56:14 Calling HandleReq <HandleChildAliveCommand>
(0)
10/29 04:56:14 Return from HandleReq <HandleChildAliveCommand>
(handler: 0.000s, sec: 0.016s)
10/29 05:15:35 Received UDP command 60008
(DC_CHILDALIVE) from <130.116.144.59:9743>, access level
DAEMON
10/29 05:15:35 Calling HandleReq <HandleChildAliveCommand>
(0)
10/29 05:15:35 Return from HandleReq <HandleChildAliveCommand>
(handler: 0.000s, sec: 0.000s)
10/29 05:15:44 Received UDP command 60008
(DC_CHILDALIVE) from <130.116.144.59:9340>, access level
DAEMON
10/29 05:15:44 Calling HandleReq <HandleChildAliveCommand>
(0)
10/29 05:15:44 Return from HandleReq <HandleChildAliveCommand>
(handler: 0.000s, sec: 0.000s)
10/29 05:17:09 Received UDP command 60011
(DC_NOP) from <130.116.144.59:9221>, access level READ
10/29
05:17:09 Calling HandleReq <handle_nop()> (0)
10/29 05:17:09 Return
from HandleReq <handle_nop()> (handler: 0.000s, sec: 0.016s)
10/29
05:17:09 DaemonCore: pid 3632 exited with status 44, invoking reaper 1
<Daemons::DefaultReaper()>
10/29 05:17:09 The SCHEDD (pid 3632) exited
with status 44
10/29 05:17:09 cannot send softkill since WINDOWS_SOFTKILL is
undefined
10/29 05:17:09 Sending obituary for
"C:\PROGRA~1\condor/bin/condor_schedd.exe"
10/29 05:17:09 my_popen:
CreateProcess failed
10/29 05:17:09 Failed to access email program
"C:\PROGRA~1\condor/bin/condor_mail.exe"
10/29 05:17:09 restarting
C:\PROGRA~1\condor/bin/condor_schedd.exe in 10 seconds
10/29 05:17:09
DaemonCore: return from reaper for pid 3632
10/29 05:17:19 ERROR:
C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows
executable
10/29 05:17:19 ERROR: Create_Process failed trying to start
C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:19 restarting
C:\PROGRA~1\condor/bin/condor_schedd.exe in 11 seconds
10/29 05:17:30 ERROR:
C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows
executable
10/29 05:17:30 ERROR: Create_Process failed trying to start
C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:30 restarting
C:\PROGRA~1\condor/bin/condor_schedd.exe in 13 seconds
10/29 05:17:43 ERROR:
C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows
executable
10/29 05:17:43 ERROR: Create_Process failed trying to start
C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:17:43 restarting
C:\PROGRA~1\condor/bin/condor_schedd.exe in 17 seconds
10/29 05:18:00 ERROR:
C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows
executable
10/29 05:18:00 ERROR: Create_Process failed trying to start
C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:18:00 restarting
C:\PROGRA~1\condor/bin/condor_schedd.exe in 25 seconds
10/29 05:18:25 ERROR:
C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows
executable
10/29 05:18:25 ERROR: Create_Process failed trying to start
C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:18:25 restarting
C:\PROGRA~1\condor/bin/condor_schedd.exe in 41 seconds
10/29 05:19:06 ERROR:
C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows
executable
10/29 05:19:06 ERROR: Create_Process failed trying to start
C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:19:06 restarting
C:\PROGRA~1\condor/bin/condor_schedd.exe in 73 seconds
10/29 05:20:19 ERROR:
C:\PROGRA~1\condor/bin/condor_schedd.exe is not a valid Windows
executable
10/29 05:20:19 ERROR: Create_Process failed trying to start
C:\PROGRA~1\condor/bin/condor_schedd.exe
10/29 05:20:19 restarting
C:\PROGRA~1\condor/bin/condor_schedd.exe in 137 seconds
10/29 09:53:30
UnsetEnv(NET_REMAP_ENABLE): SetEnvironmentVariable failed,
errno=203
Thanks for any
info/help.
Cheers
Greg
Logs for access
violation problems- exit code -1073741819
MasterLog
10/27 09:17:00
DaemonCore: pid 3588 exited with status -1073741819, invoking reaper 1
<Daemons::DefaultReaper()>
10/27 09:17:00 The QUILL (pid 3588) died due
to exception ACCESS_VIOLATION
10/27 09:17:00 Sending obituary for
"C:\PROGRA~1\condor/bin/condor_quill.exe"
10/27 09:17:03 restarting
C:\PROGRA~1\condor/bin/condor_quill.exe in 10 seconds
10/27 09:17:03
DaemonCore: return from reaper for pid 3588
10/27 09:17:13 Started DaemonCore
process "C:\PROGRA~1\condor/bin/condor_quill.exe", pid and pgroup =
4052
10/27 09:17:13 Received UDP command 60008 (DC_CHILDALIVE) from
<130.116.67.243:9263>, access level DAEMON
10/27 09:17:13 Calling
HandleReq <HandleChildAliveCommand> (0)
10/27 09:17:13 Return from
HandleReq <HandleChildAliveCommand> (handler: 0.000s, sec:
0.015s)
10/27 09:19:03 Received UDP command 60011 (DC_NOP) from
<130.116.67.243:9836>, access level READ
10/27 09:19:03 Calling
HandleReq <handle_nop()> (0)
10/27 09:19:03 Return from HandleReq
<handle_nop()> (handler: 0.000s, sec: 0.000s)
10/27 09:19:03
DaemonCore: pid 4052 exited with status -1073741819, invoking reaper 1
<Daemons::DefaultReaper()>
10/27 09:19:03 The QUILL (pid 4052) died due
to exception ACCESS_VIOLATION
10/27 09:19:03 Sending obituary for
"C:\PROGRA~1\condor/bin/condor_quill.exe"
10/27 09:19:05 restarting
C:\PROGRA~1\condor/bin/condor_quill.exe in 11 seconds
10/27 09:19:05
DaemonCore: return from reaper for pid 4052
QuillLog
10/27 09:15:11
******************************************************
10/27 09:15:11 **
condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:15:11 **
C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:15:11 ** SubsystemInfo:
name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:15:11 ** Configuration:
subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:15:11 **
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:15:11 **
$CondorPlatform: INTEL-WINNT50 $
10/27 09:15:11 ** PID = 3588
10/27
09:15:11 ** Log last touched 10/27 09:01:21
10/27 09:15:11
******************************************************
10/27 09:15:11 Using
config source: c:\PROGRA~1\condor\condor_config
10/27 09:15:11 Using local
config sources:
10/27 09:15:11
C:\PROGRA~1\condor/condor_config.local
10/27 09:15:11 DaemonCore: Command
Socket at <130.116.67.243:9494>
10/27 09:15:11 main_init()
called
10/27 09:15:11 configuring tt options from config file
10/27
09:15:11 Using Polling Period = 10
10/27 09:15:11 Using logs 10/27 09:15:11
C:\PROGRA~1\condor/log/sql.log 10/27 09:15:11
10/27 09:15:11 Using Job Queue
File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:15:11 Using Database
Type = Postgres
10/27 09:15:11 Using Database IpAddress =
condorquill.csiro.au:5432
10/27 09:15:11 Using Database Name =
quilldatabase
10/27 09:15:11 Using Database User = quillwriter
10/27
09:15:12 ******** Start of Polling Job Queue Log ********
10/27 09:15:12 JOB
QUEUE POLLING RESULT: COMPRESSED
10/27 09:15:12 ********* End of Polling Job
Queue Log *********
10/27 09:15:12 ******** Start of Polling Event Log
********
10/27 09:17:13
******************************************************
10/27 09:17:13 **
condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:17:13 **
C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:17:13 ** SubsystemInfo:
name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:17:13 ** Configuration:
subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:17:13 **
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:17:13 **
$CondorPlatform: INTEL-WINNT50 $
10/27 09:17:13 ** PID = 4052
10/27
09:17:13 ** Log last touched 10/27 09:15:12
10/27 09:17:13
******************************************************
10/27 09:17:13 Using
config source: c:\PROGRA~1\condor\condor_config
10/27 09:17:13 Using local
config sources:
10/27 09:17:13
C:\PROGRA~1\condor/condor_config.local
10/27 09:17:13 DaemonCore: Command
Socket at <130.116.67.243:9459>
10/27 09:17:13 main_init()
called
10/27 09:17:13 configuring tt options from config file
10/27
09:17:13 Using Polling Period = 10
10/27 09:17:13 Using logs 10/27 09:17:13
C:\PROGRA~1\condor/log/sql.log 10/27 09:17:13
10/27 09:17:13 Using Job Queue
File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:17:13 Using Database
Type = Postgres
10/27 09:17:13 Using Database IpAddress =
condorquill.csiro.au:5432
10/27 09:17:13 Using Database Name =
quilldatabase
10/27 09:17:13 Using Database User = quillwriter
10/27
09:17:13 ******** Start of Polling Job Queue Log ********
10/27 09:17:13 JOB
QUEUE POLLING RESULT: COMPRESSED
10/27 09:17:14 ********* End of Polling Job
Queue Log *********
10/27 09:17:14 ******** Start of Polling Event Log
********
10/27 09:19:17
******************************************************
10/27 09:19:17 **
condor_quill.exe (CONDOR_QUILL) STARTING UP
10/27 09:19:17 **
C:\PROGRA~1\condor\bin\condor_quill.exe
10/27 09:19:17 ** SubsystemInfo:
name=QUILL type=DAEMON(10) class=DAEMON(1)
10/27 09:19:17 ** Configuration:
subsystem:QUILL local:<NONE> class:DAEMON
10/27 09:19:17 **
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
10/27 09:19:17 **
$CondorPlatform: INTEL-WINNT50 $
10/27 09:19:17 ** PID = 2900
10/27
09:19:17 ** Log last touched 10/27 09:17:14
10/27 09:19:17
******************************************************
10/27 09:19:17 Using
config source: c:\PROGRA~1\condor\condor_config
10/27 09:19:17 Using local
config sources:
10/27 09:19:17
C:\PROGRA~1\condor/condor_config.local
10/27 09:19:17 DaemonCore: Command
Socket at <130.116.67.243:9496>
10/27 09:19:17 main_init()
called
10/27 09:19:17 configuring tt options from config file
10/27
09:19:17 Using Polling Period = 10
10/27 09:19:17 Using logs 10/27 09:19:17
C:\PROGRA~1\condor/log/sql.log 10/27 09:19:17
10/27 09:19:17 Using Job Queue
File C:\PROGRA~1\condor/spool/job_queue.log
10/27 09:19:17 Using Database
Type = Postgres
10/27 09:19:17 Using Database IpAddress =
condorquill.csiro.au:5432
10/27 09:19:17 Using Database Name =
quilldatabase
10/27 09:19:17 Using Database User = quillwriter
10/27
09:19:17 ******** Start of Polling Job Queue Log ********
10/27 09:19:17 JOB
QUEUE POLLING RESULT: COMPRESSED
10/27 09:19:17 ********* End of Polling Job
Queue Log *********
10/27 09:19:17 ******** Start of Polling Event Log
********
core.WIN.QUILL32
//=====================================================
Exception
code: C0000005 ACCESS_VIOLATION
Fault address: 00401895 01:00000895
C:\PROGRA~1\condor\bin\condor_quill.exe
Registers:
EAX:00000000
EBX:00C8BFFF
ECX:0012F6AC
EDX:00000000
ESI:00000000
EDI:0012F6D8
CS:EIP:001B:00401895
SS:ESP:0023:0012F5DC
EBP:0012F66C
DS:0023 ES:0023 FS:003B
GS:0000
Flags:00010256
Call
stack:
Address Frame
00401895 0012F66C
condor_ttdb_buildts
(c:\condor\execute\dir_5692\userdir\src\condor_tt\condor_ttdb.cpp:64)
0040C620
0012F85C TTManager::insertScheddAd
(c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:1579)
0040F6F7
0012F910 TTManager::event_maintain
(c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:599)
0040FDD5
0012F9B4 TTManager::maintain
(c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:250)
0041018A
0012F9C0 TTManager::pollingTime
(c:\condor\execute\dir_5692\userdir\src\condor_tt\ttmanager.cpp:199)
004222AF
0012FA64 TimerManager::Timeout
(c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\timer_manager.cpp:493)
0041F38B
0012FEEC DaemonCore::Driver
(c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core.cpp:2622)
00414C62
0012FF60 dc_main
(c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core_main.cpp:2106)
00414D62
0012FF78 main
(c:\condor\execute\dir_5692\userdir\src\condor_daemon_core.v6\daemon_core_main.cpp:2169)
00487810
0012FFC0 __tmainCRTStartup
(f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c:266)
7C817077
0012FFF0 RegisterWaitForInputIdle+49
------------------------------------------------------------------------------------------------------
Greg
Hitchen
greg.hitchen@xxxxxxxx
CSIRO Exploration and
Mining
phone: +61 8 6436 8663
Australian Resources Research Centre
(ARRC)
fax: +61 8 6436 8555
Postal
address:
mob: 0407 952
748
PO Box 1130,
Bentley WA 6102,
Australia
Street
Address:
26 Dick Perry Avenue, Kensington WA
6151
-------------------------------------------------------------------------------------------------------