On Wed, 2006-05-24 at 08:05 +0000, John Coulthard wrote: > It would be great if someone could > tell me what's happened but failing that is there a list where I can lookup > what "died due to signal #" and "EXITING WITH STATUS ###" mean? Processes can be sent signals. A list of all the signals that exist (at least on this Linux box I'm working on) can be obtained via `man 7 signal`. Some signals will cause a process to terminate. When processes terminate (either Condor daemons, or a user's condor job), they can return a numeric status value. Zero, by convention, means "I ran successfully"; non-zero indicates that an error occurred. (The meaning of different return codes is application-specific.) > The MarterLog (end of) > 5/24 06:08:08 The SCHEDD (pid 30865) died due to signal 25 I'm guessing that you're running on a BSD-derived operating system (eg MacOS X.) Signal 25 on BSD 4.2 machines is described as follows: SIGXFSZ 25,25,31 Core File size limit exceeded (4.2 BSD) It looks like the SCHEDD exceeded some hard-coded file size limit in the operating system, possibly in it's history or log files. As a result, the OS sent a SIGXFSZ (#25) signal to it, which killed it. > The ShadowLog (end of) > 5/24 06:44:28 (84901.58) (31039): **** condor_shadow (condor_SHADOW) EXITING > WITH STATUS 100 > 5/24 06:44:31 getpeername failed so connect must have failed > 5/24 06:49:29 Connect failed for 300 seconds; returning FALSE > 5/24 06:49:29 Can't connect to queue manager > CEDAR:6001:Failed to connect to <192.168.0.40:52226> > 5/24 06:49:29 ERROR "Failed to connect to schedd!" at line 102 in file > shadow_initializer.C Someone more familiar with Condor can tell you what return code 100 indicates, but the error "Failed to connect to schedd!" is a bit of a give away. It's failing because it can't talk to the local Schedd (probably because the OS killed it!) > The StartLog (end of) This looks fairly normal. So, in short, it looks like your root problem is that the SCHEDD on your job-submission host is keeling over and dying. I'd have a look around to see if any of the files it normally uses have gotten very large (eg >2GB in size.) Hope this helps. Cheers, David -- David McBride <dwm@xxxxxxxxxxxx> Department of Computing, Imperial College, London
Attachment:
signature.asc
Description: This is a digitally signed message part