[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] ACCESS_VIOLATION under Windows



Ben,

Please find attached the logs and configs for the box in question.
Regards,

james


On 8/2/07, Ben Burnett <burnett@xxxxxxxxxxx> wrote:
> James:
>
> That's strange; however, you have set the configuration correctly, so it's
> nothing you're missing--it sounds as if they haven't been created.  Could
> you try turning your debugging level up (STARTER_DEBUG = D_ALL), re-run the
> job, and repost the resulting logs in full.
>
> -B
>
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Wojtek Goscinski
> Sent: Wednesday, August 01, 2007 12:08 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] ACCESS_VIOLATION under Windows
>
> Hi Ben,
>
> SMTP is currently unavailable from that machine - a firewall issue
> which i'm getting fixed.
> I set CREATE_CORE_FILES = true - which i assume should give me a core
> file in the log directory? However, I do not receive a core file in
> either the machines log directory or the directory i submitted the
> java job from.
>
> Am i missing something? do i have to set something else for core files
> to be dumped to log, or is it possible that a core file is not
> created?
>
> Regards,
>
> James
>
>
> On 7/31/07, Ben Burnett <burnett@xxxxxxxxxxx> wrote:
> >
> >
> >
> >
> > Hi James:
> >
> >
> >
> > I wonder if you could post the core file from the execute node's
> starter-it
> > should have been emailed to your admin email after the crash.
> >
> >
> >
> > -B
> >
> >
> >
> >
> > From: condor-users-bounces@xxxxxxxxxxx
> > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of
> > Wojtek Goscinski
> >  Sent: Sunday, July 29, 2007 8:19 PM
> >  To: condor-users@xxxxxxxxxxx
> >  Subject: [Condor-users] ACCESS_VIOLATION under Windows
> >
> >
> >
> >
> > Hi All,
> >
> >  I'm experiencing a problem setting up a windows box as a condor execute
> > node - specifically to execute java jobs.
> >
> >  I have a windows box running xp sp2. It is purely set up as an execute
> > node. The start deamon picks successfully picks up the job and attempts to
> > execute it. It spawns the condor_starter - but the condor_starter seems to
> > crash with an exception (an ACCESS_VIOLATION).
> >
> >  As you can see in log below, the starter process seems to try to launch
> > java, but this ends in an exception? The starter crashes immediately after
> > that last log. I've confirmed that java exists at the location specified
> > etc.
> >
> >  I assume this might be some sort of windows security issue, but I'm not
> > sure how to debug it. The condor vm user was given rights to execute the
> > java directory - though i'm not sure whether this is enough.
> >
> >  Any help or tips for debugging are most welcome.
> >
> >  -james
> >
> >
> >  Start Log
> >  -------------
> >
> >  7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: setting sock->decode()
> >  7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: allowing an empty
> message
> > for sock.
> >  7/25 16:04:52 (fd:3) (pid:3636) DC_AUTHENTICATE: Success.
> >  7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: Command received via UDP from
> > host < 172.19.189.3:9629>
> >  7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: received command 60011
> > (DC_NOP), calling handler (handle_nop())
> >  7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
> > ..\src\condor_daemon_core.V6\daemon_core.C:2743
> >  7/25 16:04:52 (fd:3) (pid:3636) Calling Handler
> > <HandleDC_SERVICEWAITPIDS()> for Signal 60009 <DC_SERVICEWAITPIDS>
> >  7/25 16:04:52 (fd:3) (pid:3636) KEYCACHEX: removing session
> > hp-test-02:3636:1185343491:6 for <172.19.189.3:9618 >
> >  7/25 16:04:52 (fd:3) (pid:3636) DaemonCore: pid 3940 exited with status
> > -1073741819, invoking reaper 1 <reaper>
> >  7/25 16:04:52 (fd:3) (pid:3636) Starter pid 3940 died on signal
> -1073741819
> > (exception ACCESS_VIOLATION)
> >  7/25 16:04:52 (fd:3) (pid:3636) Entering ProcFamily::hardkill
> >  7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
> > ..\src\condor_c++_util\killfamily.C:274
> >  7/25 16:04:52 (fd:3) (pid:3636) Destroying Daemon object:
> >  7/25 16:04:52 (fd:3) (pid:3636) Type: 1 (any), Name: (null), Addr: <
> > 172.19.189.3:9611>
> >  7/25 16:04:52 (fd:3) (pid:3636) FullHost: (null), Host: (null), Pool:
> > (null), Port: -1
> >  7/25 16:04:52 (fd:3) (pid:3636) IsLocal: N, IdStr: (null), Error: (null)
> >  7/25 16:04:52 (fd:3) (pid:3636)  --- End of Daemon object info ---
> >  7/25 16:04:52 (fd:3) (pid:3636) ProcAPI: pid # 3940 was not found
> > (OpenProcess err=1308)
> >  7/25 16:04:52 (fd:3) (pid:3636) ProcAPI: pid # 3940 was not found
> > (OpenProcess err=1308)
> >  7/25 16:04:52 (fd:3) (pid:3636) ProcFamily: parent: 3940 family:
> >  7/25 16:04:52 (fd:3) (pid:3636) ProcFamily: alive_cpu_user = 0,
> exited_cpu
> > = 0, max_image = 3624k
> >  7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
> > ..\src\condor_c++_util\killfamily.C:475
> >  7/25 16:04:52 (fd:3) (pid:3636) Attempting to remove
> > C:\condor\execute\dir_3940 as SuperUser (system)
> >  7/25 16:04:52 (fd:3) (pid:3636) Deleted ProcFamily w/ pid 3940 as parent
> >  7/25 16:04:52 (fd:3) (pid:3636) State change: starter exited
> >  7/25 16:04:52 (fd:3) (pid:3636) Changing activity: Busy -> Idle
> >  7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
> > ..\src\condor_daemon_core.V6\daemon_core.C:2743
> >  7/25 16:04:52 (fd:3) (pid:3636) In cancel_timer(), id=66
> >  7/25 16:04:52 (fd:3) (pid:3636) PRIV_CONDOR --> PRIV_CONDOR at
> > ..\src\condor_daemon_core.V6\daemon_core.C:2743
> >  7/25 16:04:52 (fd:3) (pid:3636) In DaemonCore Timeout()
> >  7/25 16:04:52 (fd:3) (pid:3636)
> >
> >  Starter Log
> >  ----------------
> >  7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: setting sock->decode()
> >  7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: allowing an empty
> message
> > for sock.
> >  7/25 16:04:51 (fd:8) (pid:3940) DC_AUTHENTICATE: Success.
> >  7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: Command received via UDP from
> > host < 172.19.189.3:9614>
> >  7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: received command 60011
> > (DC_NOP), calling handler (handle_nop())
> >  7/25 16:04:51 (fd:8) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at
> > ..\src\condor_daemon_core.V6\daemon_core.C:2743
> >  7/25 16:04:51 (fd:8) (pid:3940) Calling Handler
> > <HandleDC_SERVICEWAITPIDS()> for Signal 60009 <DC_SERVICEWAITPIDS>
> >  7/25 16:04:51 (fd:8) (pid:3940) DaemonCore: tid 3300 exited with status
> 1,
> > invoking reaper 2 <FileTransfer::Reaper()>
> >  7/25 16:04:51 (fd:8) (pid:3940) File transfer completed successfully.
> >  7/25 16:04:51 (fd:6) (pid:3940) Destroying Daemon object:
> >  7/25 16:04:51 (fd:6) (pid:3940) Type: 1 (any), Name: (null), Addr:
> > <172.19.189.3:9618>
> >  7/25 16:04:51 (fd:6) (pid:3940) FullHost: (null), Host: (null), Pool:
> > (null), Port: -1
> >  7/25 16:04:51 (fd:6) (pid:3940) IsLocal: N, IdStr: (null), Error: (null)
> >  7/25 16:04:51 (fd:6) (pid:3940)  --- End of Daemon object info ---
> >  7/25 16:04:52 (fd:6) (pid:3940) Calling client FileTransfer handler
> > function.
> >  7/25 16:04:52 (fd:6) (pid:3940) in DaemonCore NewTimer()
> >  7/25 16:04:52 (fd:6) (pid:3940)
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> Timers
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> ~~~~~~
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 7, when = 1185343492,
> > period = 0, handler_descrip=<deferred job start>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 6, when = 1185343551,
> > period = 0, handler_descrip=<dc_touch_log_file>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 3, when = 1185343731,
> > period = 240, handler_descrip=<self_monitor>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 1, when = 1185343791,
> > period = 300, handler_descrip=<check_session_cache>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 5, when = 1185344661,
> > period = 1170,
> > handler_descrip=<DaemonCore::SendAliveToParent>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 2, when = 1185345292,
> > period = 1801, handler_descrip=<handle_cookie_refresh>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 4, when = 1185372290,
> > period = 0, handler_descrip=<DaemonCore::ReInit()>
> >  7/25 16:04:52 (fd:6) (pid:3940)
> >  7/25 16:04:52 (fd:6) (pid:3940) leaving DaemonCore NewTimer, id=7
> >  7/25 16:04:52 (fd:6) (pid:3940) Job 71.0 set to execute immediately
> >  7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at
> > ..\src\condor_daemon_core.V6\daemon_core.C:2743
> >  7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_CONDOR at
> > ..\src\condor_daemon_core.V6\daemon_core.C:2743
> >  7/25 16:04:52 (fd:6) (pid:3940) In DaemonCore Timeout()
> >  7/25 16:04:52 (fd:6) (pid:3940)
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> Timers
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> ~~~~~~
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 7, when = 1185343492,
> > period = 0, handler_descrip=<deferred job start>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 6, when = 1185343551,
> > period = 0, handler_descrip=<dc_touch_log_file>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 3, when = 1185343731,
> > period = 240, handler_descrip=<self_monitor>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 1, when = 1185343791,
> > period = 300, handler_descrip=<check_session_cache>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 5, when = 1185344661,
> > period = 1170,
> > handler_descrip=<DaemonCore::SendAliveToParent>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 2, when = 1185345292,
> > period = 1801, handler_descrip=<handle_cookie_refresh>
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore--> id = 4, when = 1185372290,
> > period = 0, handler_descrip=<DaemonCore::ReInit()>
> >  7/25 16:04:52 (fd:6) (pid:3940)
> >  7/25 16:04:52 (fd:6) (pid:3940) DaemonCore: Calling handler for Timer 7
> > (deferred job start)
> >  7/25 16:04:52 (fd:6) (pid:3940) Starting a JAVA universe job with ID:
> 71.0
> >  7/25 16:04:52 (fd:6) (pid:3940) In OsProc::OsProc()
> >  7/25 16:04:52 (fd:6) (pid:3940) Main job KillSignal: 15 (Unknown)
> >  7/25 16:04:52 (fd:6) (pid:3940) Main job RmKillSignal: 15 (Unknown)
> >  7/25 16:04:52 (fd:6) (pid:3940) Main job HoldKillSignal: 15 (Unknown)
> >  7/25 16:04:52 (fd:6) (pid:3940) SYSAPI_GET_LOADAVG is undefined, using
> > default value of True
> >  7/25 16:04:52 (fd:6) (pid:3940) JavaProc: Cmd="C:\\Program
> > Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE"
> >  7/25 16:04:52 (fd:6) (pid:3940) JavaProc: Args=-Xmx247m -classpath
> > C:\condor/lib;C:\condor/lib/scimark2lib.jar;.
> > -Dchirp.config=C:\condor\execute\dir_3940\chirp.config
> > CondorJavaWrapper C:\condor\execute\dir_3940\jvm.start
> > C:\condor\execute\dir_3940\jvm.end JavaTest
> >  7/25 16:04:52 (fd:6) (pid:3940) in VanillaProc::StartJob()
> >  7/25 16:04:52 (fd:6) (pid:3940) in OsProc::StartJob()
> >  7/25 16:04:52 (fd:6) (pid:3940) IWD: C:\condor/execute\dir_3940
> >  7/25 16:04:52 (fd:6) (pid:3940) get_port_range - (LOWPORT,HIGHPORT) is
> > (9600,9700).
> >  7/25 16:04:52 (fd:6) (pid:3940) TokenCache contents:
> >  condor-reuse-vm1@.
> >  7/25 16:04:52 (fd:6) (pid:3940) PRIV_CONDOR --> PRIV_USER at
> > ..\src\condor_starter.V6.1\os_proc.C:227
> >  7/25 16:04:52 (fd:7) (pid:3940) Input file: NUL
> >  7/25 16:04:52 (fd:8) (pid:3940) Output file:
> > C:\condor/execute\dir_3940\JavaTest.output.0
> >  7/25 16:04:52 (fd:9) (pid:3940) Error file:
> > C:\condor/execute\dir_3940\JavaTest.error.0
> >  7/25 16:04:52 (fd:9) (pid:3940) Doing CONDOR_begin_execution
> >  7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfds=0
> >  7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfound=1
> >  7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfds=0
> >  7/25 16:04:52 (fd:9) (pid:3940) condor_read(): nfound=1
> >  7/25 16:04:52 (fd:9) (pid:3940) Renice expr "10" evaluated to 10
> >  7/25 16:04:52 (fd:9) (pid:3940) About to exec
> > C:\condor/execute\dir_3940\"C:\\Program
> > Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE" -Xmx247m
> > -classpath C:\condor/lib;C:\condor/lib/scimark2lib.jar;. -
> > Dchirp.config=C:\condor\execute\dir_3940\chirp.config
> > CondorJavaWrapper C:\condor\execute\dir_3940\jvm.start
> > C:\condor\execute\dir_3940\jvm.end JavaTest
> >  7/25 16:04:52 (fd:9) (pid:3940) Env =
> > _CONDOR_SCRATCH_DIR=C:\condor\execute\dir_3940
> > _CONDOR_HIGHPORT=9700 _CONDOR_LOWPORT=9600
> >  7/25 16:04:52 (fd:9) (pid:3940)
> > JOB_INHERITS_STARTER_ENVIRONMENT is undefined, using
> > default value of False
> >  7/25 16:04:52 (fd:9) (pid:3940) PRIV_USER --> PRIV_CONDOR at
> > ..\src\condor_starter.V6.1\os_proc.C:343
> >  7/25 16:04:52 (fd:9) (pid:3940) In
> > DaemonCore::Create_Process(C:\condor/execute\dir_3940\"C:\\Program
> > Files\\Java\\jre1.5.0_06\\bin\\JAVA.EXE",...)
> >
> >
> >
> >
> >
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to
> > condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> >
> >
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>

Attachment: Logs.zip
Description: Zip archive