Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] Schedd sending startddeactivate_claim_forciblycommand?
- Date: Tue, 2 Nov 2004 15:34:22 -0500
- From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
- Subject: RE: [Condor-users] Schedd sending startddeactivate_claim_forciblycommand?
Opps. That's my fault for not checking that carefully enough. Although
based on your pointing this out you can ignore my statement that this is
only occuring when jobs are Retiring/Preempting. We seem to have a
schedd issuing DEACTIVATE_CLAIM_FORCIBILY on running jobs as well then.
Ian
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: November 2, 2004 3:31 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Schedd sending startd
> deactivate_claim_forciblycommand?
>
> Ian,
>
> In the log output you provided, I don't see any transition to
> retirement or preemption due to any competing claim. All I
> see is a job running
> (Claimed/Busy) and then DEACTIVATE_CLAIM_FORCIBLY causing the
> claim to be closed. Do you believe this is true in general
> or did you just post a log snippet that is not representative
> of the problem you are seeing?
>
> --Dan
>
> Ian Chesal wrote:
>
> >I'm encountering a mysterious situation with our cluster. Jobs are
> >starting, entering the retirement phase because of machine
> rank based
> >on queue times, and then not being allowed the full
> >MaxJobRetirementTime to finish -- instead the scheduler from
> which all
> >the jobs orginated is sending this "command 404
> >(DEACTIVATE_CLAIM_FORCIBLY)" to the startd running the job
> and it's being killed off.
> >
> >On my startd machine I have the following log output (see below).
> >Following the vm2 starter you can see it gets a claim,
> starts running
> >it, moves it to retirement/preemption because of rank and
> then gets a
> >forcible deactivate command from the schedd machine (137.57.142.51).
> >Why is the schedd calling this command on the startd? Why is in not
> >retiring the job and allowing it to continue through the
> >MaxJobRetirementTime (which is set to 2 days)?
> >
> >We are a little baffled at how we got into this mess. With our RANK
> >based on queue time we're seeing jobs just trashing in our system.
> >
> >Thanks in advance for any insight you can offer!
> >
> >Ian
> >
> >11/2 14:16:20 DaemonCore: Command received via UDP from host
> ><137.57.176.9:35511>
> >11/2 14:16:20 DaemonCore: received command 440 (MATCH_INFO), calling
> >handler (command_match_info)
> >11/2 14:16:20 vm2: match_info called
> >11/2 14:16:20 vm2: Received match
> <137.57.176.186:2147>#1099203682#159
> >11/2 14:16:20 vm2: State change: match notification protocol
> successful
> >11/2 14:16:20 vm2: Changing state: Unclaimed -> Matched
> >11/2 14:16:20 DaemonCore: Command received via TCP from host
> ><137.57.142.51:3349>
> >11/2 14:16:20 DaemonCore: received command 442
> (REQUEST_CLAIM), calling
> >handler (command_request_claim)
> >11/2 14:16:20 vm2: Request accepted.
> >11/2 14:16:20 vm2: Remote owner is gquan@xxxxxxxxxx
> >11/2 14:16:20 vm2: State change: claiming protocol successful
> >11/2 14:16:20 vm2: Changing state: Matched -> Claimed
> >11/2 14:16:27 DaemonCore: Command received via TCP from host
> ><137.57.142.51:3365>
> >11/2 14:16:27 DaemonCore: received command 444 (ACTIVATE_CLAIM),
> >calling handler (command_activate_claim)
> >11/2 14:16:27 vm2: Got activate_claim request from shadow
> >(<137.57.142.51:3365>)
> >11/2 14:16:27 vm2: Remote job ID is 3.3
> >11/2 14:16:27 vm2: Got universe "VANILLA" (5) from request classad
> >11/2 14:16:27 vm2: State change: claim-activation protocol successful
> >11/2 14:16:27 vm2: Changing activity: Idle -> Busy
> >11/2 14:21:23 DaemonCore: Command received via TCP from host
> ><137.57.176.186:2309>
> >11/2 14:21:23 DaemonCore: received command 448 (GIVE_STATE), calling
> >handler (command_give_state)
> >11/2 14:32:03 DaemonCore: Command received via TCP from host
> ><137.57.142.51:3699>
> >11/2 14:32:03 DaemonCore: received command 404
> >(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> >11/2 14:32:03 vm1: Called deactivate_claim_forcibly()
> >11/2 14:32:03 DaemonCore: Command received via UDP from host
> ><137.57.142.51:3702>
> >11/2 14:32:03 DaemonCore: received command 443
> (RELEASE_CLAIM), calling
> >handler (command_release_claim)
> >11/2 14:32:03 vm1: State change: received RELEASE_CLAIM command
> >11/2 14:32:03 vm1: Changing state and activity: Claimed/Busy ->
> >Preempting/Vacating
> >11/2 14:32:03 DaemonCore: Command received via UDP from host
> ><137.57.142.51:3703>
> >11/2 14:32:03 DaemonCore: received command 443
> (RELEASE_CLAIM), calling
> >handler (command_release_claim)
> >11/2 14:32:03 vm1: Got RELEASE_CLAIM while in Preempting state,
> >ignoring.
> >11/2 14:32:04 DaemonCore: Command received via UDP from host
> ><137.57.176.186:2333>
> >11/2 14:32:04 DaemonCore: received command 60001 (DC_PROCESSEXIT),
> >calling handler (HandleProcessExitCommand())
> >11/2 14:32:04 Starter pid 1896 exited with status 0
> >11/2 14:32:13 vm1: State change: starter exited
> >11/2 14:32:13 vm1: State change: No preempting claim, returning to
> >owner
> >11/2 14:32:13 vm1: Changing state and activity:
> Preempting/Vacating ->
> >Owner/Idle
> >11/2 14:32:13 vm1: State change: IS_OWNER is false
> >11/2 14:32:13 vm1: Changing state: Owner -> Unclaimed
> >11/2 14:32:47 DaemonCore: Command received via TCP from host
> ><137.57.142.51:3726>
> >11/2 14:32:47 DaemonCore: received command 404
> >(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> >11/2 14:32:47 vm2: Called deactivate_claim_forcibly()
> >11/2 14:32:47 DaemonCore: Command received via UDP from host
> ><137.57.176.186:2343>
> >11/2 14:32:47 DaemonCore: received command 60001 (DC_PROCESSEXIT),
> >calling handler (HandleProcessExitCommand())
> >11/2 14:32:47 Starter pid 1024 exited with status 0
> >11/2 14:32:54 vm2: State change: starter exited
> >11/2 14:32:58 vm2: Changing activity: Busy -> Idle
> >11/2 14:32:58 DaemonCore: Command received via UDP from host
> ><137.57.142.51:3729>
> >11/2 14:32:58 DaemonCore: received command 443
> (RELEASE_CLAIM), calling
> >handler (command_release_claim)
> >11/2 14:32:58 vm2: State change: received RELEASE_CLAIM command
> >11/2 14:32:58 vm2: Changing state and activity: Claimed/Idle ->
> >Preempting/Vacating
> >11/2 14:32:58 vm2: State change: No preempting claim, returning to
> >owner
> >11/2 14:32:58 vm2: Changing state and activity:
> Preempting/Vacating ->
> >Owner/Idle
> >11/2 14:32:58 vm2: State change: IS_OWNER is false
> >11/2 14:32:58 vm2: Changing state: Owner -> Unclaimed
> >11/2 14:32:58 DaemonCore: Command received via UDP from host
> ><137.57.142.51:3730>
> >11/2 14:32:58 DaemonCore: received command 443
> (RELEASE_CLAIM), calling
> >handler (command_release_claim)
> >11/2 14:32:58 Error: can't find resource with ClaimId
> >(<137.57.176.186:2147>#1099203682#159)
> >
> >
> >_______________________________________________
> >Condor-users mailing list
> >Condor-users@xxxxxxxxxxx
> >http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> >
>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>