Re: [Condor-users] Startd segment violation


Date: Wed, 2 Feb 2005 12:57:03 -0600 (CST)
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Subject: Re: [Condor-users] Startd segment violation
On Wed, 2 Feb 2005, Mark Calleja wrote:

> Jaime Frey wrote:
>
> > On Wed, 2 Feb 2005, Mark Calleja wrote:
> >
> >
> >
> >> unusual, but the StartLog reports:
> >>
> >> 2/2 16:56:10 Starter pid 19086 died on signal 11 (signal 11).
> >>
> >> That's a segment violation there.  My question is, is that Condor's way
> >> of telling me that the user's application is segmenting, or the Start
> >> daemon itself? We see this behaviour on a number of linux boxes, all
> >> running dynamically linked versions of Condor 6.6.8 (seen it with 6.6.7
> >> too), for glibc 2.2 and 2.3.
> >>
> >
> >
> > That's the condor starter daemon seg-faulting. It has its own log file
> > (StarterLog) which should hopefully have more clues as to what's going
> > wrong.
> >
> >
> >
>
> That's just it: the StarterLog logs happy enough. Nothing is recorded
> until the job is restarted and rescheduled on the same machine, whence
> you get the usual startup stuff, e.g. (note the timestamps; all machines
> are synched to same ntpd server):
>
> 2/2 16:38:36 ******************************************************
> 2/2 16:38:36 ** condor_starter (CONDOR_STARTER) STARTING UP
> 2/2 16:38:36 ** /usr/Condor/RH7/condor-6.6.8-glibc22/sbin/condor_starter
> 2/2 16:38:36 ** $CondorVersion: 6.6.8 Jan 27 2005 $
> 2/2 16:38:36 ** $CondorPlatform: I386-LINUX_RH72 $
> 2/2 16:38:36 ** PID = 19086
> 2/2 16:38:36 ******************************************************
> 2/2 16:38:36 Using config file: /Condor/RH7/condor_config
> 2/2 16:38:36 Using local config files: /home/condor/condor_config.local
> 2/2 16:38:36 DaemonCore: Command Socket at <172.24.116.1:9696>
> 2/2 16:38:36 Done setting resource limits
> 2/2 16:38:36 Starter communicating with condor_shadow <172.24.116.10:9710>
> 2/2 16:38:36 Submitting machine is
> "tiger02--escience.grid.private.cam.ac.uk"
> 2/2 16:38:36 VM1_USER set, so running job as condor_user
> 2/2 16:38:37 File transfer completed successfully.
> 2/2 16:38:38 Starting a VANILLA universe job with ID: 2605.0
> 2/2 16:38:38 IWD: /home/condor/execute/dir_19086
> 2/2 16:38:38 Output file:
> /home/condor/execute/dir_19086/_condor_stdout_2605.0
> 2/2 16:38:38 Error file:
> /home/condor/execute/dir_19086/_condor_stderr_2605.0
> 2/2 16:38:38 Renice expr "19" evaluated to 19
> 2/2 16:38:38 About to exec /home/condor/execute/dir_19086/condor_exec.exe
> 2/2 16:38:38 Create_Process succeeded, pid=19088
> 2/2 16:56:12 ******************************************************
> 2/2 16:56:12 ** condor_starter (CONDOR_STARTER) STARTING UP
> 2/2 16:56:12 ** /usr/Condor/RH7/condor-6.6.8-glibc22/sbin/condor_starter
> 2/2 16:56:12 ** $CondorVersion: 6.6.8 Jan 27 2005 $
> 2/2 16:56:12 ** $CondorPlatform: I386-LINUX_RH72 $
> 2/2 16:56:12 ** PID = 19292
> 2/2 16:56:12 ******************************************************
> 2/2 16:56:12 Using config file: /Condor/RH7/condor_config
> 2/2 16:56:12 Using local config files: /home/condor/condor_config.local
> 2/2 16:56:12 DaemonCore: Command Socket at <172.24.116.1:9652>
> 2/2 16:56:12 Done setting resource limits
> 2/2 16:56:12 Starter communicating with condor_shadow <172.24.116.10:9701>
> 2/2 16:56:12 Submitting machine is
> "tiger02--escience.grid.private.cam.ac.uk"
> 2/2 16:56:12 VM1_USER set, so running job as condor_user
> 2/2 16:56:13 File transfer completed successfully.
> 2/2 16:56:14 Starting a VANILLA universe job with ID: 2605.0
> 2/2 16:56:14 IWD: /home/condor/execute/dir_19292
> 2/2 16:56:14 Output file:
> /home/condor/execute/dir_19292/_condor_stdout_2605.0
> 2/2 16:56:14 Error file:
> /home/condor/execute/dir_19292/_condor_stderr_2605.0
> 2/2 16:56:15 Renice expr "19" evaluated to 19
> 2/2 16:56:15 About to exec /home/condor/execute/dir_19292/condor_exec.exe
> 2/2 16:56:15 Create_Process succeeded, pid=19294
>
> The application runs for quite a while before the problems happen. Am I
> missing something obvious?

Are there any core files in the condor log directory on the execute
machine?

+----------------------------------+---------------------------------+
|            Jaime Frey            |  Public Split on Whether        |
|        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
|  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |
+----------------------------------+---------------------------------+

[← Prev in Thread] Current Thread [Next in Thread→]