Jaime Frey wrote:
On Wed, 2 Feb 2005, Mark Calleja wrote:
Jaime Frey wrote:
On Wed, 2 Feb 2005, Mark Calleja wrote:
unusual, but the StartLog reports:
2/2 16:56:10 Starter pid 19086 died on signal 11 (signal 11).
That's a segment violation there. My question is, is that Condor's way
of telling me that the user's application is segmenting, or the Start
daemon itself? We see this behaviour on a number of linux boxes, all
running dynamically linked versions of Condor 6.6.8 (seen it with 6.6.7
too), for glibc 2.2 and 2.3.
That's the condor starter daemon seg-faulting. It has its own log file
(StarterLog) which should hopefully have more clues as to what's going
wrong.
That's just it: the StarterLog logs happy enough. Nothing is recorded
until the job is restarted and rescheduled on the same machine, whence
you get the usual startup stuff, e.g. (note the timestamps; all machines
are synched to same ntpd server):
2/2 16:38:36 ******************************************************
2/2 16:38:36 ** condor_starter (CONDOR_STARTER) STARTING UP
2/2 16:38:36 ** /usr/Condor/RH7/condor-6.6.8-glibc22/sbin/condor_starter
2/2 16:38:36 ** $CondorVersion: 6.6.8 Jan 27 2005 $
2/2 16:38:36 ** $CondorPlatform: I386-LINUX_RH72 $
2/2 16:38:36 ** PID = 19086
2/2 16:38:36 ******************************************************
2/2 16:38:36 Using config file: /Condor/RH7/condor_config
2/2 16:38:36 Using local config files: /home/condor/condor_config.local
2/2 16:38:36 DaemonCore: Command Socket at <172.24.116.1:9696>
2/2 16:38:36 Done setting resource limits
2/2 16:38:36 Starter communicating with condor_shadow <172.24.116.10:9710>
2/2 16:38:36 Submitting machine is
"tiger02--escience.grid.private.cam.ac.uk"
2/2 16:38:36 VM1_USER set, so running job as condor_user
2/2 16:38:37 File transfer completed successfully.
2/2 16:38:38 Starting a VANILLA universe job with ID: 2605.0
2/2 16:38:38 IWD: /home/condor/execute/dir_19086
2/2 16:38:38 Output file:
/home/condor/execute/dir_19086/_condor_stdout_2605.0
2/2 16:38:38 Error file:
/home/condor/execute/dir_19086/_condor_stderr_2605.0
2/2 16:38:38 Renice expr "19" evaluated to 19
2/2 16:38:38 About to exec /home/condor/execute/dir_19086/condor_exec.exe
2/2 16:38:38 Create_Process succeeded, pid=19088
2/2 16:56:12 ******************************************************
2/2 16:56:12 ** condor_starter (CONDOR_STARTER) STARTING UP
2/2 16:56:12 ** /usr/Condor/RH7/condor-6.6.8-glibc22/sbin/condor_starter
2/2 16:56:12 ** $CondorVersion: 6.6.8 Jan 27 2005 $
2/2 16:56:12 ** $CondorPlatform: I386-LINUX_RH72 $
2/2 16:56:12 ** PID = 19292
2/2 16:56:12 ******************************************************
2/2 16:56:12 Using config file: /Condor/RH7/condor_config
2/2 16:56:12 Using local config files: /home/condor/condor_config.local
2/2 16:56:12 DaemonCore: Command Socket at <172.24.116.1:9652>
2/2 16:56:12 Done setting resource limits
2/2 16:56:12 Starter communicating with condor_shadow <172.24.116.10:9701>
2/2 16:56:12 Submitting machine is
"tiger02--escience.grid.private.cam.ac.uk"
2/2 16:56:12 VM1_USER set, so running job as condor_user
2/2 16:56:13 File transfer completed successfully.
2/2 16:56:14 Starting a VANILLA universe job with ID: 2605.0
2/2 16:56:14 IWD: /home/condor/execute/dir_19292
2/2 16:56:14 Output file:
/home/condor/execute/dir_19292/_condor_stdout_2605.0
2/2 16:56:14 Error file:
/home/condor/execute/dir_19292/_condor_stderr_2605.0
2/2 16:56:15 Renice expr "19" evaluated to 19
2/2 16:56:15 About to exec /home/condor/execute/dir_19292/condor_exec.exe
2/2 16:56:15 Create_Process succeeded, pid=19294
The application runs for quite a while before the problems happen. Am I
missing something obvious?
Are there any core files in the condor log directory on the execute
machine?
No, nothing. The machines I'm seeing this on are using up most of their
physical memory with these jobs, possibly swapping. However, I don't see
why that should cause this problem (these boxes are RH 7.2 and Suse
9.0). A RHE box which has memory to spare doesn't show any such problems.
BTW, I'm using the dynamic RH9 Condor build for the Suse 9.0 platforms;
is this what is recommended? You might be interested to know that when I
tried using the static build for 6.6.8 on Suse 9.0 then the Shadow dies
with a signal 11, but that's another story!
Thx for any help,
Mark
|