Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] jobs are being killed after 30-45 minutes
- Date: Wed, 26 Jul 2006 10:24:04 -0500
- From: Erik Paulson <epaulson@xxxxxxxxxxx>
- Subject: Re: [Condor-users] jobs are being killed after 30-45 minutes
On Wed, Jul 26, 2006 at 01:21:14PM +0100, Santanu Das wrote:
> 7/25 09:56:13 State change: claim-activation protocol successful
> 7/25 09:56:13 Changing activity: Idle -> Busy
> 7/25 10:07:16 Starter pid 29124 died on signal 11 (signal 11)
> 7/25 10:07:16 State change: starter exited
> 7/25 10:07:16 Changing activity: Busy -> Idle
> 7/25 10:07:17 DaemonCore: Command received via TCP from host
> <172.24.116.151:9583>
> 7/25 10:07:17 DaemonCore: received command 444 (ACTIVATE_CLAIM),
> calling handler (command_activate_claim)
> 7/25 10:07:17 Got activate_claim request from shadow
> (<172.24.116.151:9583>)
> 7/25 10:07:17 Remote job ID is 7773.0
> 7/25 10:07:17 Got universe "VANILLA" (5) from request classad
> 7/25 10:07:17 State change: claim-activation protocol successful
> 7/25 10:07:17 Changing activity: Idle -> Busy
>
>
> Is this due to "signal 11" issue - what does this actually mean?
>
It means there's a bug in Condor.
If possible, could you upgrade to 6.8.0? We'd much rather see
if the bug is still present in Condor; based on the log file thus
far I'd suspect that it's something that has already been fixed.
If you can't, for now please set
STARTER_DEBUG = D_ALL
MAX_STARTER_LOG = 10000000
on all machines where your job may run, run a job,
and then send the StarterLog from the machine that crashed
to condor-admin@xxxxxxxxxxx
-Erik