Re: [Condor-users] Startd segment violation


Date: Wed, 02 Feb 2005 17:42:58 +0000
From: Mark Calleja <mcal00@xxxxxxxxxxxxx>
Subject: Re: [Condor-users] Startd segment violation
Jaime Frey wrote:

On Wed, 2 Feb 2005, Mark Calleja wrote:



unusual, but the StartLog reports:

2/2 16:56:10 Starter pid 19086 died on signal 11 (signal 11).

That's a segment violation there. My question is, is that Condor's way
of telling me that the user's application is segmenting, or the Start
daemon itself? We see this behaviour on a number of linux boxes, all
running dynamically linked versions of Condor 6.6.8 (seen it with 6.6.7
too), for glibc 2.2 and 2.3.


That's the condor starter daemon seg-faulting. It has its own log file
(StarterLog) which should hopefully have more clues as to what's going
wrong.




That's just it: the StarterLog logs happy enough. Nothing is recorded until the job is restarted and rescheduled on the same machine, whence you get the usual startup stuff, e.g. (note the timestamps; all machines are synched to same ntpd server):


2/2 16:38:36 ******************************************************
2/2 16:38:36 ** condor_starter (CONDOR_STARTER) STARTING UP
2/2 16:38:36 ** /usr/Condor/RH7/condor-6.6.8-glibc22/sbin/condor_starter
2/2 16:38:36 ** $CondorVersion: 6.6.8 Jan 27 2005 $
2/2 16:38:36 ** $CondorPlatform: I386-LINUX_RH72 $
2/2 16:38:36 ** PID = 19086
2/2 16:38:36 ******************************************************
2/2 16:38:36 Using config file: /Condor/RH7/condor_config
2/2 16:38:36 Using local config files: /home/condor/condor_config.local
2/2 16:38:36 DaemonCore: Command Socket at <172.24.116.1:9696>
2/2 16:38:36 Done setting resource limits
2/2 16:38:36 Starter communicating with condor_shadow <172.24.116.10:9710>
2/2 16:38:36 Submitting machine is "tiger02--escience.grid.private.cam.ac.uk"
2/2 16:38:36 VM1_USER set, so running job as condor_user
2/2 16:38:37 File transfer completed successfully.
2/2 16:38:38 Starting a VANILLA universe job with ID: 2605.0
2/2 16:38:38 IWD: /home/condor/execute/dir_19086
2/2 16:38:38 Output file: /home/condor/execute/dir_19086/_condor_stdout_2605.0
2/2 16:38:38 Error file: /home/condor/execute/dir_19086/_condor_stderr_2605.0
2/2 16:38:38 Renice expr "19" evaluated to 19
2/2 16:38:38 About to exec /home/condor/execute/dir_19086/condor_exec.exe
2/2 16:38:38 Create_Process succeeded, pid=19088
2/2 16:56:12 ******************************************************
2/2 16:56:12 ** condor_starter (CONDOR_STARTER) STARTING UP
2/2 16:56:12 ** /usr/Condor/RH7/condor-6.6.8-glibc22/sbin/condor_starter
2/2 16:56:12 ** $CondorVersion: 6.6.8 Jan 27 2005 $
2/2 16:56:12 ** $CondorPlatform: I386-LINUX_RH72 $
2/2 16:56:12 ** PID = 19292
2/2 16:56:12 ******************************************************
2/2 16:56:12 Using config file: /Condor/RH7/condor_config
2/2 16:56:12 Using local config files: /home/condor/condor_config.local
2/2 16:56:12 DaemonCore: Command Socket at <172.24.116.1:9652>
2/2 16:56:12 Done setting resource limits
2/2 16:56:12 Starter communicating with condor_shadow <172.24.116.10:9701>
2/2 16:56:12 Submitting machine is "tiger02--escience.grid.private.cam.ac.uk"
2/2 16:56:12 VM1_USER set, so running job as condor_user
2/2 16:56:13 File transfer completed successfully.
2/2 16:56:14 Starting a VANILLA universe job with ID: 2605.0
2/2 16:56:14 IWD: /home/condor/execute/dir_19292
2/2 16:56:14 Output file: /home/condor/execute/dir_19292/_condor_stdout_2605.0
2/2 16:56:14 Error file: /home/condor/execute/dir_19292/_condor_stderr_2605.0
2/2 16:56:15 Renice expr "19" evaluated to 19
2/2 16:56:15 About to exec /home/condor/execute/dir_19292/condor_exec.exe
2/2 16:56:15 Create_Process succeeded, pid=19294


The application runs for quite a while before the problems happen. Am I missing something obvious?

Cheers,
Mark


[← Prev in Thread] Current Thread [Next in Thread→]