Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs die with signal 11

Date: Tue, 28 Sep 2004 09:15:25 -0500
From: Nick LeRoy <nleroy@xxxxxxxxxxx>
Subject: Re: [Condor-users] Jobs die with signal 11

On Tue September 28 2004 12:25 pm, mcal00@xxxxxxxxxxxxx wrote:
> Hi, we've got an old linux cluster (i286 processors running RH7.2) that
> we've converted into a Condor pool and we constantly see jobs dying with
> Shadow exceptions, with the only clue in the StarterLog files being of the
> form (Condor v. 6.6.6 all round):
>
> 9/27 14:12:33 vm2: Got activate_claim request from shadow
> (<172.24.116.193:42835>)
> 9/27 14:12:33 vm2: Remote job ID is 352.0
> 9/27 14:12:33 vm2: Got universe "VANILLA" (5) from request classad
> 9/27 14:12:33 vm2: State change: claim-activation protocol successful
> 9/27 14:12:33 vm2: Changing activity: Idle -> Busy
> 9/27 14:21:22 Starter pid 21927 died on signal 11 (signal 11)
> 9/27 14:21:22 vm2: State change: starter exited
> 9/27 14:21:22 vm2: Changing activity: Busy -> Idle
>
> What's that signal 11 mean? I notice that someone spotted something similar
> under solaris last year (message 476), and Erik Paulson suggested that it
> may have been a bug. Was it ever resolved?

Signal 11 means "segmentation fault" (SEGV); i.e. the program crashed.  Mostly 
likely, this is due to a buggy application being started by Condor.

> These jobs are coming in from flocked pools across the campus, so the
> network they have to traverse is slightly unfriendlier than your average
> LAN. Could such a signal be due to a network glitch?

I really can't see how network topology could cause a job to get a SEGV, 
except in some unusual circumstances.

One thing to try would be to run the job's executable directly on a machine 
that you've seen it crash on (directly as in without Condor involved at all).  
If it runs like that without crashing, then there's some unintended 
interaction, but, most likely, you'll see the same behavior.  In the above 
log, the job had been running for ~11 minutes, so you shouldn't have to wait 
very long.

There is one other thing that I can thing of looking into: environment 
variables.  User jobs may get started with a very different set of 
environment variables than those that they were submitted with, which could 
possibly cause a buggy application to crash as well.

Hope this helps

-Nick

-- 
           <<< There is no spoon. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences

Follow-Ups:
- Re: [Condor-users] Jobs die with signal 11
  - From: Mark Calleja

References:
- [Condor-users] Jobs die with signal 11
  - From: mcal00

Prev by Date: [Condor-users] Re: Jobs die with signal 11
Next by Date: Re: [Condor-users] Jobs die with signal 11
Previous by thread: [Condor-users] Jobs die with signal 11
Next by thread: Re: [Condor-users] Jobs die with signal 11
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Jobs die with signal 11