[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Schedd sending startddeactivate_claim_forciblycommand?



Opps. That's my fault for not checking that carefully enough. Although
based on your pointing this out you can ignore my statement that this is
only occuring when jobs are Retiring/Preempting. We seem to have a
schedd issuing DEACTIVATE_CLAIM_FORCIBILY on running jobs as well then.

Ian 

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: November 2, 2004 3:31 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Schedd sending startd 
> deactivate_claim_forciblycommand?
> 
> Ian,
> 
> In the log output you provided, I don't see any transition to 
> retirement or preemption due to any competing claim.  All I 
> see is a job running
> (Claimed/Busy) and then DEACTIVATE_CLAIM_FORCIBLY causing the 
> claim to be closed.  Do you believe this is true in general 
> or did you just post a log snippet that is not representative 
> of the problem you are seeing?
> 
> --Dan
> 
> Ian Chesal wrote:
> 
> >I'm encountering a mysterious situation with our cluster. Jobs are 
> >starting, entering the retirement phase because of machine 
> rank based 
> >on queue times, and then not being allowed the full 
> >MaxJobRetirementTime to finish -- instead the scheduler from 
> which all 
> >the jobs orginated is sending this "command 404 
> >(DEACTIVATE_CLAIM_FORCIBLY)" to the startd running the job 
> and it's being killed off.
> > 
> >On my startd machine I have the following log output (see below).
> >Following the vm2 starter you can see it gets a claim, 
> starts running 
> >it, moves it to retirement/preemption because of rank and 
> then gets a 
> >forcible deactivate command from the schedd machine (137.57.142.51). 
> >Why is the schedd calling this command on the startd? Why is in not 
> >retiring the job and allowing it to continue through the 
> >MaxJobRetirementTime (which is set to 2 days)?
> > 
> >We are a little baffled at how we got into this mess. With our RANK 
> >based on queue time we're seeing jobs just trashing in our system.
> > 
> >Thanks in advance for any insight you can offer!
> > 
> >Ian
> > 
> >11/2 14:16:20 DaemonCore: Command received via UDP from host 
> ><137.57.176.9:35511>
> >11/2 14:16:20 DaemonCore: received command 440 (MATCH_INFO), calling 
> >handler (command_match_info)
> >11/2 14:16:20 vm2: match_info called
> >11/2 14:16:20 vm2: Received match 
> <137.57.176.186:2147>#1099203682#159
> >11/2 14:16:20 vm2: State change: match notification protocol 
> successful
> >11/2 14:16:20 vm2: Changing state: Unclaimed -> Matched
> >11/2 14:16:20 DaemonCore: Command received via TCP from host 
> ><137.57.142.51:3349>
> >11/2 14:16:20 DaemonCore: received command 442 
> (REQUEST_CLAIM), calling 
> >handler (command_request_claim)
> >11/2 14:16:20 vm2: Request accepted.
> >11/2 14:16:20 vm2: Remote owner is gquan@xxxxxxxxxx
> >11/2 14:16:20 vm2: State change: claiming protocol successful
> >11/2 14:16:20 vm2: Changing state: Matched -> Claimed
> >11/2 14:16:27 DaemonCore: Command received via TCP from host 
> ><137.57.142.51:3365>
> >11/2 14:16:27 DaemonCore: received command 444 (ACTIVATE_CLAIM), 
> >calling handler (command_activate_claim)
> >11/2 14:16:27 vm2: Got activate_claim request from shadow
> >(<137.57.142.51:3365>)
> >11/2 14:16:27 vm2: Remote job ID is 3.3
> >11/2 14:16:27 vm2: Got universe "VANILLA" (5) from request classad
> >11/2 14:16:27 vm2: State change: claim-activation protocol successful
> >11/2 14:16:27 vm2: Changing activity: Idle -> Busy
> >11/2 14:21:23 DaemonCore: Command received via TCP from host 
> ><137.57.176.186:2309>
> >11/2 14:21:23 DaemonCore: received command 448 (GIVE_STATE), calling 
> >handler (command_give_state)
> >11/2 14:32:03 DaemonCore: Command received via TCP from host 
> ><137.57.142.51:3699>
> >11/2 14:32:03 DaemonCore: received command 404 
> >(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> >11/2 14:32:03 vm1: Called deactivate_claim_forcibly()
> >11/2 14:32:03 DaemonCore: Command received via UDP from host 
> ><137.57.142.51:3702>
> >11/2 14:32:03 DaemonCore: received command 443 
> (RELEASE_CLAIM), calling 
> >handler (command_release_claim)
> >11/2 14:32:03 vm1: State change: received RELEASE_CLAIM command
> >11/2 14:32:03 vm1: Changing state and activity: Claimed/Busy -> 
> >Preempting/Vacating
> >11/2 14:32:03 DaemonCore: Command received via UDP from host 
> ><137.57.142.51:3703>
> >11/2 14:32:03 DaemonCore: received command 443 
> (RELEASE_CLAIM), calling 
> >handler (command_release_claim)
> >11/2 14:32:03 vm1: Got RELEASE_CLAIM while in Preempting state, 
> >ignoring.
> >11/2 14:32:04 DaemonCore: Command received via UDP from host 
> ><137.57.176.186:2333>
> >11/2 14:32:04 DaemonCore: received command 60001 (DC_PROCESSEXIT), 
> >calling handler (HandleProcessExitCommand())
> >11/2 14:32:04 Starter pid 1896 exited with status 0
> >11/2 14:32:13 vm1: State change: starter exited
> >11/2 14:32:13 vm1: State change: No preempting claim, returning to 
> >owner
> >11/2 14:32:13 vm1: Changing state and activity: 
> Preempting/Vacating -> 
> >Owner/Idle
> >11/2 14:32:13 vm1: State change: IS_OWNER is false
> >11/2 14:32:13 vm1: Changing state: Owner -> Unclaimed
> >11/2 14:32:47 DaemonCore: Command received via TCP from host 
> ><137.57.142.51:3726>
> >11/2 14:32:47 DaemonCore: received command 404 
> >(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> >11/2 14:32:47 vm2: Called deactivate_claim_forcibly()
> >11/2 14:32:47 DaemonCore: Command received via UDP from host 
> ><137.57.176.186:2343>
> >11/2 14:32:47 DaemonCore: received command 60001 (DC_PROCESSEXIT), 
> >calling handler (HandleProcessExitCommand())
> >11/2 14:32:47 Starter pid 1024 exited with status 0
> >11/2 14:32:54 vm2: State change: starter exited
> >11/2 14:32:58 vm2: Changing activity: Busy -> Idle
> >11/2 14:32:58 DaemonCore: Command received via UDP from host 
> ><137.57.142.51:3729>
> >11/2 14:32:58 DaemonCore: received command 443 
> (RELEASE_CLAIM), calling 
> >handler (command_release_claim)
> >11/2 14:32:58 vm2: State change: received RELEASE_CLAIM command
> >11/2 14:32:58 vm2: Changing state and activity: Claimed/Idle -> 
> >Preempting/Vacating
> >11/2 14:32:58 vm2: State change: No preempting claim, returning to 
> >owner
> >11/2 14:32:58 vm2: Changing state and activity: 
> Preempting/Vacating -> 
> >Owner/Idle
> >11/2 14:32:58 vm2: State change: IS_OWNER is false
> >11/2 14:32:58 vm2: Changing state: Owner -> Unclaimed
> >11/2 14:32:58 DaemonCore: Command received via UDP from host 
> ><137.57.142.51:3730>
> >11/2 14:32:58 DaemonCore: received command 443 
> (RELEASE_CLAIM), calling 
> >handler (command_release_claim)
> >11/2 14:32:58 Error: can't find resource with ClaimId
> >(<137.57.176.186:2147>#1099203682#159)
> >
> >
> >_______________________________________________
> >Condor-users mailing list
> >Condor-users@xxxxxxxxxxx
> >http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >  
> >
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>