Hi James:
I wonder if you could post the core file from the execute node’s
starter—it should have been emailed to your admin email after the crash.
-B
From:
condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On
Behalf Of Wojtek Goscinski
Sent: Sunday, July 29, 2007 8:19 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] ACCESS_VIOLATION under Windows
Hi All,
I'm experiencing a problem setting up a windows box as a condor execute node -
specifically to execute java jobs.
I have a windows box running xp sp2. It is purely set up as an execute node.
The start deamon picks successfully picks up the job and attempts to execute
it. It spawns the condor_starter - but the condor_starter seems to crash with
an exception (an ACCESS_VIOLATION).
As you can see in log below, the starter process seems to try to launch java,
but this ends in an exception? The starter crashes immediately after that last
log. I've confirmed that java exists at the location specified etc.
I assume this might be some sort of windows security issue, but I'm not sure
how to debug it. The condor vm user was given rights to execute the java
directory - though i'm not sure whether this is enough.
Any help or tips for debugging are most welcome.
-james
Start Log
-------------
7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: setting sock->decode()
7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: allowing an empty message for
sock.
7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: Success.
7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: Command received via UDP from host
< 172.19.189.3:9629>
7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: received command 60011 (DC_NOP),
calling handler (handle_nop())
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:3) (pid:3636) Calling Handler
<HandleDC_SERVICEWAITPIDS()> for Signal 60009 <DC_SERVICEWAITPIDS>
7/25 16:04:52 (fd:3) (pid:3636) KEYCACHEX: removing session
hp-test-02:3636:1185343491:6 for <172.19.189.3:9618 >
7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: pid 3940 exited with status
-1073741819, invoking reaper 1 <reaper>
7/25 16:04:52 (fd:3) (pid:3636) Starter pid 3940 died on signal -1073741819
(exception ACCESS_VIOLATION)
7/25 16:04:52 (fd:3) (pid:3636) Entering ProcFamily::hardkill
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_c++_util\killfamily.C:274
7/25 16:04:52 (fd:3) (pid:3636) Destroying Daemon object:
7/25 16:04:52 (fd:3) (pid:3636) Type: 1 (any), Name: (null), Addr: < 172.19.189.3:9611>
7/25 16:04:52 (fd:3) (pid:3636) FullHost: (null), Host: (null), Pool: (null),
Port: -1
7/25 16:04:52 (fd:3) (pid:3636) IsLocal: N, IdStr: (null), Error: (null)
7/25 16:04:52 (fd:3) (pid:3636) --- End of Daemon object info ---
7/25 16:04:52 (fd:3) (pid:3636) ProcAPI: pid # 3940 was not found (OpenProcess
err=1308)
7/25 16:04:52 (fd:3) (pid:3636) ProcAPI: pid # 3940 was not found (OpenProcess
err=1308)
7/25 16:04:52 (fd:3) (pid:3636) ProcFamily: parent: 3940 family:
7/25 16:04:52 (fd:3) (pid:3636) ProcFamily: alive_cpu_user = 0, exited_cpu = 0,
max_image = 3624k
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_c++_util\killfamily.C:475
7/25 16:04:52 (fd:3) (pid:3636) Attempting to remove C:\condor\execute\dir_3940
as SuperUser (system)
7/25 16:04:52 (fd:3) (pid:3636) Deleted ProcFamily w/ pid 3940 as parent
7/25 16:04:52 (fd:3) (pid:3636) State change: starter exited
7/25 16:04:52 (fd:3) (pid:3636) Changing activity: Busy -> Idle
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:3) (pid:3636) In cancel_timer(), id=66
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:3) (pid:3636) In DaemonCore Timeout()
7/25 16:04:52 (fd:3) (pid:3636)
Starter Log
----------------
7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: setting sock->decode()
7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: allowing an empty message for
sock.
7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: Success.
7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: Command received via UDP from host
< 172.19.189.3:9614>
7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: received command 60011 (DC_NOP),
calling handler (handle_nop())
7/25 16:04:51 (fd:8) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:51 (fd:8) (pid:3940) Calling Handler
<HandleDC_SERVICEWAITPIDS()> for Signal 60009 <DC_SERVICEWAITPIDS>
7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: tid 3300 exited with status 1,
invoking reaper 2 <FileTransfer::Reaper()>
7/25 16:04:51 (fd:8) (pid:3940) File transfer completed successfully.
7/25 16:04:51 (fd:6) (pid:3940) Destroying Daemon object:
7/25 16:04:51 (fd:6) (pid:3940) Type: 1 (any), Name: (null), Addr: <172.19.189.3:9618>
7/25 16:04:51 (fd:6) (pid:3940) FullHost: (null), Host: (null), Pool: (null),
Port: -1
7/25 16:04:51 (fd:6) (pid:3940) IsLocal: N, IdStr: (null), Error: (null)
7/25 16:04:51 (fd:6) (pid:3940) --- End of Daemon object info ---
7/25 16:04:52 (fd:6) (pid:3940) Calling client FileTransfer handler function.
7/25 16:04:52 (fd:6) (pid:3940) in DaemonCore NewTimer()
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> Timers
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> ~~~~~~
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 7, when = 1185343492,
period = 0, handler_descrip=<deferred job start>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 6, when = 1185343551,
period = 0, handler_descrip=<dc_touch_log_file>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 3, when = 1185343731,
period = 240, handler_descrip=<self_monitor>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 1, when = 1185343791,
period = 300, handler_descrip=<check_session_cache>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 5, when = 1185344661,
period = 1170, handler_descrip=<DaemonCore::SendAliveToParent>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 2, when = 1185345292,
period = 1801, handler_descrip=<handle_cookie_refresh>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 4, when = 1185372290,
period = 0, handler_descrip=<DaemonCore::ReInit()>
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) leaving DaemonCore NewTimer, id=7
7/25 16:04:52 (fd:6) (pid:3940) Job 71.0 set to execute immediately
7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:6) (pid:3940) In DaemonCore Timeout()
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> Timers
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> ~~~~~~
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 7, when = 1185343492,
period = 0, handler_descrip=<deferred job start>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 6, when = 1185343551,
period = 0, handler_descrip=<dc_touch_log_file>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 3, when = 1185343731,
period = 240, handler_descrip=<self_monitor>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 1, when = 1185343791,
period = 300, handler_descrip=<check_session_cache>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 5, when = 1185344661,
period = 1170, handler_descrip=<DaemonCore::SendAliveToParent>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 2, when = 1185345292,
period = 1801, handler_descrip=<handle_cookie_refresh>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 4, when = 1185372290,
period = 0, handler_descrip=<DaemonCore::ReInit()>
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore: Calling handler for Timer 7
(deferred job start)
7/25 16:04:52 (fd:6) (pid:3940) Starting a JAVA universe job with ID: 71.0
7/25 16:04:52 (fd:6) (pid:3940) In OsProc::OsProc()
7/25 16:04:52 (fd:6) (pid:3940) Main job KillSignal: 15 (Unknown)
7/25 16:04:52 (fd:6) (pid:3940) Main job RmKillSignal: 15 (Unknown)
7/25 16:04:52 (fd:6) (pid:3940) Main job HoldKillSignal: 15 (Unknown)
7/25 16:04:52 (fd:6) (pid:3940) SYSAPI_GET_LOADAVG is undefined, using default
value of True
7/25 16:04:52 (fd:6) (pid:3940) JavaProc: Cmd="C:\\Program
Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE"
7/25 16:04:52 (fd:6) (pid:3940) JavaProc: Args=-Xmx247m -classpath
C:\condor/lib;C:\condor/lib/scimark2lib.jar;. -Dchirp.config=C:\condor\execute\dir_3940\chirp.config
CondorJavaWrapper C:\condor\execute\dir_3940\jvm.start
C:\condor\execute\dir_3940\jvm.end JavaTest
7/25 16:04:52 (fd:6) (pid:3940) in VanillaProc::StartJob()
7/25 16:04:52 (fd:6) (pid:3940) in OsProc::StartJob()
7/25 16:04:52 (fd:6) (pid:3940) IWD: C:\condor/execute\dir_3940
7/25 16:04:52 (fd:6) (pid:3940) get_port_range - (LOWPORT,HIGHPORT) is
(9600,9700).
7/25 16:04:52 (fd:6) (pid:3940) TokenCache contents:
condor-reuse-vm1@.
7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_USER at
..\src\condor_starter.V6.1\os_proc.C:227
7/25 16:04:52 (fd:7) (pid:3940) Input file: NUL
7/25 16:04:52 (fd:8) (pid:3940) Output file:
C:\condor/execute\dir_3940\JavaTest.output.0
7/25 16:04:52 (fd:9) (pid:3940) Error file: C:\condor/execute\dir_3940\JavaTest.error.0
7/25 16:04:52 (fd:9) (pid:3940) Doing CONDOR_begin_execution
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfds=0
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfound=1
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfds=0
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfound=1
7/25 16:04:52 (fd:9) (pid:3940) Renice expr "10" evaluated to 10
7/25 16:04:52 (fd:9) (pid:3940) About to exec
C:\condor/execute\dir_3940\"C:\\Program
Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE" -Xmx247m -classpath
C:\condor/lib;C:\condor/lib/scimark2lib.jar;. -
Dchirp.config=C:\condor\execute\dir_3940\chirp.config CondorJavaWrapper
C:\condor\execute\dir_3940\jvm.start C:\condor\execute\dir_3940\jvm.end
JavaTest
7/25 16:04:52 (fd:9) (pid:3940) Env =
_CONDOR_SCRATCH_DIR=C:\condor\execute\dir_3940 _CONDOR_HIGHPORT=9700
_CONDOR_LOWPORT=9600
7/25 16:04:52 (fd:9) (pid:3940) JOB_INHERITS_STARTER_ENVIRONMENT is undefined,
using default value of False
7/25 16:04:52 (fd:9) (pid:3940) PRIV_USER --> PRIV_CONDOR at
..\src\condor_starter.V6.1\os_proc.C:343
7/25 16:04:52 (fd:9) (pid:3940) In
DaemonCore::Create_Process(C:\condor/execute\dir_3940\"C:\\Program
Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE",...)