[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] shadow exited



> > On Tue, Jun 15, 2004 at 03:54:06PM -0500, Scott Koranda wrote:
> > > Hi,
> > > 
> > > I found this in the SchedLog:
> > > 
> > > 6/15 12:48:11 Shadow pid 25284 for job 584394.0 exited with status 107
> > > 6/15 12:48:11 Called send_vacate( <129.89.200.23:41351>, 443 )
> > > 6/15 12:48:11 Sent RELEASE_CLAIM to startd on <129.89.200.23:41351>
> > > 6/15 12:48:11 Match record (<129.89.200.23:41351>, 584394, 0) deleted
> > > 6/15 12:48:11 Capability of deleted match: <129.89.200.23:41351>#1298467496
> > > 6/15 12:48:11 Entered delete_shadow_rec( 25284 ) 6/15 12:48:11 Deleting shadow rec for PID 25284, job (584394.0)
> > > 6/15 12:48:11 Entered check_zombie( 25284, 0x92af31c, st=2 )
> > > 6/15 12:48:11 Marked job 584394.0 as IDLE
> > > 
> > > Why might have the shadow for job 584394.0 exited with status
> > > 107?
> > > 
> > 
> > We want the job to go back into the queue. It's normal.  
> > 
> > Shadow exit codes should be considered a black box - we don't
> > follow UNIX conventions of exiting with '0' for success and
> > '1' for failure - we've got about 10 or so exit codes, all of
> > which are for success.
> > 
> > Is there something else going that you're worried about?
> > 
> 
> The job seems to be matched, it starts running, and then a
> short time later it is forcibly evicted and goes back into the
> queue. 
> 
> This is happening every few minutes and I am trying to
> understand why so I am poking around in the logs.
> 
> The job is part of a DAG and there are a lot of jobs logging
> to the same file, but here are the appropriate parts of the
> log file for this job:
> 
> 000 (584394.000.000) 06/15 12:43:40 Job submitted from host: <129.89.201.233:32786>
> 001 (584394.000.000) 06/15 12:45:31 Job executing on host: <129.89.200.23:41351>004 (584394.000.000) 06/15 12:48:11 Job was evicted.
> 001 (584394.000.000) 06/15 12:52:31 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 12:55:17 Job was evicted.
> 001 (584394.000.000) 06/15 12:59:59 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:02:24 Job was evicted.
> 001 (584394.000.000) 06/15 13:06:48 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:09:30 Job was evicted.
> 001 (584394.000.000) 06/15 13:13:51 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:16:36 Job was evicted.
> 001 (584394.000.000) 06/15 13:20:56 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:23:42 Job was evicted.
> 001 (584394.000.000) 06/15 13:28:16 Job executing on host: <129.89.200.12:32773>004 (584394.000.000) 06/15 13:30:48 Job was evicted.

<snip>

I should all that on machine 129.89.200.12 in the StartLog I
see many cycles for job  584394.0 that look like this:

6/15 12:59:57 Remote job ID is 584394.0
6/15 12:59:57 exec_starter( medusa-master003.medusa.phys.uwm.edu, 10, 11 ) : pid 3981
6/15 12:59:57 execl(/opt/condor/sbin/condor_starter.std, "condor_starter", medusa-master003.medusa.phys.uwm.edu, 0)
6/15 12:59:57 Got RemoteUser (duncan@xxxxxxxxxxxxxxxxxxx) from request classad
6/15 12:59:57 Got universe "STANDARD" (1) from request classad
6/15 12:59:57 State change: claim-activation protocol successful
6/15 12:59:57 Changing activity: Idle -> Busy
6/15 13:00:01 Attempting to send update via UDP to collector condor.medusa.phys.uwm.edu <129.89.201.238:9618>
6/15 13:00:01 Sent update to 1 collector(s)
6/15 13:02:03 DaemonCore: Command received via TCP from host <129.89.201.233:48680>
6/15 13:02:03 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
6/15 13:02:03 Called deactivate_claim_forcibly()
6/15 13:02:03 In Starter::kill() with pid 3981, sig 3 (SIGQUIT)
6/15 13:02:03 DaemonCore: No more children processes to reap.
6/15 13:02:03 Starter pid 3981 exited with status 0
6/15 13:02:03 Canceled hardkill-starter timer (1747)
6/15 13:02:03 ProcAPI::buildFamily failed: parent 3981 not found on system.
6/15 13:02:03 ProcAPI: pid 3981 does not exist.
6/15 13:02:03 ProcAPI: pid 3984 does not exist.
6/15 13:02:03 cleanup_execute_dir(): /opt/condor/home/execute/dir_3981 does not exist
6/15 13:02:03 State change: starter exited
6/15 13:02:03 Changing activity: Busy -> Idle

Further, this is Condor 6.6.1.

Thanks,

Scott