James:
That's strange; however, you have set the configuration correctly, so it's
nothing you're missing--it sounds as if they haven't been created. Could
you try turning your debugging level up (STARTER_DEBUG = D_ALL), re-run the
job, and repost the resulting logs in full.
-B
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Wojtek Goscinski
Sent: Wednesday, August 01, 2007 12:08 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] ACCESS_VIOLATION under Windows
Hi Ben,
SMTP is currently unavailable from that machine - a firewall issue
which i'm getting fixed.
I set CREATE_CORE_FILES = true - which i assume should give me a core
file in the log directory? However, I do not receive a core file in
either the machines log directory or the directory i submitted the
java job from.
Am i missing something? do i have to set something else for core files
to be dumped to log, or is it possible that a core file is not
created?
Regards,
James
On 7/31/07, Ben Burnett <burnett@xxxxxxxxxxx> wrote:
Hi James:
I wonder if you could post the core file from the execute node's
starter-it
should have been emailed to your admin email after the crash.
-B
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of
Wojtek Goscinski
Sent: Sunday, July 29, 2007 8:19 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] ACCESS_VIOLATION under Windows
Hi All,
I'm experiencing a problem setting up a windows box as a condor execute
node - specifically to execute java jobs.
I have a windows box running xp sp2. It is purely set up as an execute
node. The start deamon picks successfully picks up the job and attempts to
execute it. It spawns the condor_starter - but the condor_starter seems to
crash with an exception (an ACCESS_VIOLATION).
As you can see in log below, the starter process seems to try to launch
java, but this ends in an exception? The starter crashes immediately after
that last log. I've confirmed that java exists at the location specified
etc.
I assume this might be some sort of windows security issue, but I'm not
sure how to debug it. The condor vm user was given rights to execute the
java directory - though i'm not sure whether this is enough.
Any help or tips for debugging are most welcome.
-james
Start Log
-------------
7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: setting sock->decode()
7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: allowing an empty
message
for sock.
7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: Success.
7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: Command received via UDP from
host < 172.19.189.3:9629>
7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: received command 60011
(DC_NOP), calling handler (handle_nop())
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:3) (pid:3636) Calling Handler
<HandleDC_SERVICEWAITPIDS()> for Signal 60009 <DC_SERVICEWAITPIDS>
7/25 16:04:52 (fd:3) (pid:3636) KEYCACHEX: removing session
hp-test-02:3636:1185343491:6 for <172.19.189.3:9618 >
7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: pid 3940 exited with status
-1073741819, invoking reaper 1 <reaper>
7/25 16:04:52 (fd:3) (pid:3636) Starter pid 3940 died on signal
-1073741819
(exception ACCESS_VIOLATION)
7/25 16:04:52 (fd:3) (pid:3636) Entering ProcFamily::hardkill
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_c++_util\killfamily.C:274
7/25 16:04:52 (fd:3) (pid:3636) Destroying Daemon object:
7/25 16:04:52 (fd:3) (pid:3636) Type: 1 (any), Name: (null), Addr: <
172.19.189.3:9611>
7/25 16:04:52 (fd:3) (pid:3636) FullHost: (null), Host: (null), Pool:
(null), Port: -1
7/25 16:04:52 (fd:3) (pid:3636) IsLocal: N, IdStr: (null), Error: (null)
7/25 16:04:52 (fd:3) (pid:3636) --- End of Daemon object info ---
7/25 16:04:52 (fd:3) (pid:3636) ProcAPI: pid # 3940 was not found
(OpenProcess err=1308)
7/25 16:04:52 (fd:3) (pid:3636) ProcAPI: pid # 3940 was not found
(OpenProcess err=1308)
7/25 16:04:52 (fd:3) (pid:3636) ProcFamily: parent: 3940 family:
7/25 16:04:52 (fd:3) (pid:3636) ProcFamily: alive_cpu_user = 0,
exited_cpu
= 0, max_image = 3624k
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_c++_util\killfamily.C:475
7/25 16:04:52 (fd:3) (pid:3636) Attempting to remove
C:\condor\execute\dir_3940 as SuperUser (system)
7/25 16:04:52 (fd:3) (pid:3636) Deleted ProcFamily w/ pid 3940 as parent
7/25 16:04:52 (fd:3) (pid:3636) State change: starter exited
7/25 16:04:52 (fd:3) (pid:3636) Changing activity: Busy -> Idle
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:3) (pid:3636) In cancel_timer(), id=66
7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:3) (pid:3636) In DaemonCore Timeout()
7/25 16:04:52 (fd:3) (pid:3636)
Starter Log
----------------
7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: setting sock->decode()
7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: allowing an empty
message
for sock.
7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: Success.
7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: Command received via UDP from
host < 172.19.189.3:9614>
7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: received command 60011
(DC_NOP), calling handler (handle_nop())
7/25 16:04:51 (fd:8) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:51 (fd:8) (pid:3940) Calling Handler
<HandleDC_SERVICEWAITPIDS()> for Signal 60009 <DC_SERVICEWAITPIDS>
7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: tid 3300 exited with status
1,
invoking reaper 2 <FileTransfer::Reaper()>
7/25 16:04:51 (fd:8) (pid:3940) File transfer completed successfully.
7/25 16:04:51 (fd:6) (pid:3940) Destroying Daemon object:
7/25 16:04:51 (fd:6) (pid:3940) Type: 1 (any), Name: (null), Addr:
<172.19.189.3:9618>
7/25 16:04:51 (fd:6) (pid:3940) FullHost: (null), Host: (null), Pool:
(null), Port: -1
7/25 16:04:51 (fd:6) (pid:3940) IsLocal: N, IdStr: (null), Error: (null)
7/25 16:04:51 (fd:6) (pid:3940) --- End of Daemon object info ---
7/25 16:04:52 (fd:6) (pid:3940) Calling client FileTransfer handler
function.
7/25 16:04:52 (fd:6) (pid:3940) in DaemonCore NewTimer()
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> Timers
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> ~~~~~~
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 7, when = 1185343492,
period = 0, handler_descrip=<deferred job start>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 6, when = 1185343551,
period = 0, handler_descrip=<dc_touch_log_file>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 3, when = 1185343731,
period = 240, handler_descrip=<self_monitor>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 1, when = 1185343791,
period = 300, handler_descrip=<check_session_cache>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 5, when = 1185344661,
period = 1170,
handler_descrip=<DaemonCore::SendAliveToParent>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 2, when = 1185345292,
period = 1801, handler_descrip=<handle_cookie_refresh>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 4, when = 1185372290,
period = 0, handler_descrip=<DaemonCore::ReInit()>
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) leaving DaemonCore NewTimer, id=7
7/25 16:04:52 (fd:6) (pid:3940) Job 71.0 set to execute immediately
7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at
..\src\condor_daemon_core.V6\daemon_core.C:2743
7/25 16:04:52 (fd:6) (pid:3940) In DaemonCore Timeout()
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> Timers
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> ~~~~~~
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 7, when = 1185343492,
period = 0, handler_descrip=<deferred job start>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 6, when = 1185343551,
period = 0, handler_descrip=<dc_touch_log_file>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 3, when = 1185343731,
period = 240, handler_descrip=<self_monitor>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 1, when = 1185343791,
period = 300, handler_descrip=<check_session_cache>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 5, when = 1185344661,
period = 1170,
handler_descrip=<DaemonCore::SendAliveToParent>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 2, when = 1185345292,
period = 1801, handler_descrip=<handle_cookie_refresh>
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 4, when = 1185372290,
period = 0, handler_descrip=<DaemonCore::ReInit()>
7/25 16:04:52 (fd:6) (pid:3940)
7/25 16:04:52 (fd:6) (pid:3940) DaemonCore: Calling handler for Timer 7
(deferred job start)
7/25 16:04:52 (fd:6) (pid:3940) Starting a JAVA universe job with ID:
71.0
7/25 16:04:52 (fd:6) (pid:3940) In OsProc::OsProc()
7/25 16:04:52 (fd:6) (pid:3940) Main job KillSignal: 15 (Unknown)
7/25 16:04:52 (fd:6) (pid:3940) Main job RmKillSignal: 15 (Unknown)
7/25 16:04:52 (fd:6) (pid:3940) Main job HoldKillSignal: 15 (Unknown)
7/25 16:04:52 (fd:6) (pid:3940) SYSAPI_GET_LOADAVG is undefined, using
default value of True
7/25 16:04:52 (fd:6) (pid:3940) JavaProc: Cmd="C:\\Program
Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE"
7/25 16:04:52 (fd:6) (pid:3940) JavaProc: Args=-Xmx247m -classpath
C:\condor/lib;C:\condor/lib/scimark2lib.jar;.
-Dchirp.config=C:\condor\execute\dir_3940\chirp.config
CondorJavaWrapper C:\condor\execute\dir_3940\jvm.start
C:\condor\execute\dir_3940\jvm.end JavaTest
7/25 16:04:52 (fd:6) (pid:3940) in VanillaProc::StartJob()
7/25 16:04:52 (fd:6) (pid:3940) in OsProc::StartJob()
7/25 16:04:52 (fd:6) (pid:3940) IWD: C:\condor/execute\dir_3940
7/25 16:04:52 (fd:6) (pid:3940) get_port_range - (LOWPORT,HIGHPORT) is
(9600,9700).
7/25 16:04:52 (fd:6) (pid:3940) TokenCache contents:
condor-reuse-vm1@.
7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_USER at
..\src\condor_starter.V6.1\os_proc.C:227
7/25 16:04:52 (fd:7) (pid:3940) Input file: NUL
7/25 16:04:52 (fd:8) (pid:3940) Output file:
C:\condor/execute\dir_3940\JavaTest.output.0
7/25 16:04:52 (fd:9) (pid:3940) Error file:
C:\condor/execute\dir_3940\JavaTest.error.0
7/25 16:04:52 (fd:9) (pid:3940) Doing CONDOR_begin_execution
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfds=0
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfound=1
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfds=0
7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfound=1
7/25 16:04:52 (fd:9) (pid:3940) Renice expr "10" evaluated to 10
7/25 16:04:52 (fd:9) (pid:3940) About to exec
C:\condor/execute\dir_3940\"C:\\Program
Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE" -Xmx247m
-classpath C:\condor/lib;C:\condor/lib/scimark2lib.jar;. -
Dchirp.config=C:\condor\execute\dir_3940\chirp.config
CondorJavaWrapper C:\condor\execute\dir_3940\jvm.start
C:\condor\execute\dir_3940\jvm.end JavaTest
7/25 16:04:52 (fd:9) (pid:3940) Env =
_CONDOR_SCRATCH_DIR=C:\condor\execute\dir_3940
_CONDOR_HIGHPORT=9700 _CONDOR_LOWPORT=9600
7/25 16:04:52 (fd:9) (pid:3940)
JOB_INHERITS_STARTER_ENVIRONMENT is undefined, using
default value of False
7/25 16:04:52 (fd:9) (pid:3940) PRIV_USER --> PRIV_CONDOR at
..\src\condor_starter.V6.1\os_proc.C:343
7/25 16:04:52 (fd:9) (pid:3940) In
DaemonCore::Create_Process(C:\condor/execute\dir_3940\"C:\\Program
Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE",...)
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/