Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] scheduling problem?
- Date: Wed, 24 May 2006 11:56:39 +0100
- From: "Kewley, J \(John\)" <j.kewley@xxxxxxxx>
- Subject: Re: [Condor-users] scheduling problem?
For status codes, you can also see:
http://www.cs.wisc.edu/~adesmet/status.html
In this case, the shadow exit status probably isn't too helpful;
"100 JOB_EXITED The job exited (not killed)"
JK
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of David McBride
> Sent: Wednesday, May 24, 2006 11:27 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] scheduling problem?
>
>
> On Wed, 2006-05-24 at 08:05 +0000, John Coulthard wrote:
>
> > It would be great if someone could
> > tell me what's happened but failing that is there a list
> where I can lookup
> > what "died due to signal #" and "EXITING WITH STATUS ###" mean?
>
> Processes can be sent signals. A list of all the signals
> that exist (at
> least on this Linux box I'm working on) can be obtained via `man 7
> signal`.
>
> Some signals will cause a process to terminate. When processes
> terminate (either Condor daemons, or a user's condor job), they can
> return a numeric status value. Zero, by convention, means "I ran
> successfully"; non-zero indicates that an error occurred.
>
> (The meaning of different return codes is application-specific.)
>
> > The MarterLog (end of)
> > 5/24 06:08:08 The SCHEDD (pid 30865) died due to signal 25
>
> I'm guessing that you're running on a BSD-derived operating system (eg
> MacOS X.) Signal 25 on BSD 4.2 machines is described as follows:
>
> SIGXFSZ 25,25,31 Core File size limit exceeded (4.2 BSD)
>
> It looks like the SCHEDD exceeded some hard-coded file size
> limit in the
> operating system, possibly in it's history or log files. As a result,
> the OS sent a SIGXFSZ (#25) signal to it, which killed it.
>
> > The ShadowLog (end of)
> > 5/24 06:44:28 (84901.58) (31039): **** condor_shadow
> (condor_SHADOW) EXITING
> > WITH STATUS 100
> > 5/24 06:44:31 getpeername failed so connect must have failed
> > 5/24 06:49:29 Connect failed for 300 seconds; returning FALSE
> > 5/24 06:49:29 Can't connect to queue manager
> > CEDAR:6001:Failed to connect to <192.168.0.40:52226>
> > 5/24 06:49:29 ERROR "Failed to connect to schedd!" at line
> 102 in file
> > shadow_initializer.C
>
> Someone more familiar with Condor can tell you what return code 100
> indicates, but the error "Failed to connect to schedd!" is a bit of a
> give away. It's failing because it can't talk to the local Schedd
> (probably because the OS killed it!)
>
> > The StartLog (end of)
>
> This looks fairly normal.
>
> So, in short, it looks like your root problem is that the
> SCHEDD on your
> job-submission host is keeling over and dying. I'd have a look around
> to see if any of the files it normally uses have gotten very large (eg
> >2GB in size.)
>
> Hope this helps.
>
> Cheers,
> David
> --
> David McBride <dwm@xxxxxxxxxxxx>
> Department of Computing, Imperial College, London
>