I'm encountering a mysterious situation with our cluster. Jobs are starting, entering the retirement phase because of machine rank based on queue times, and then not being allowed the full MaxJobRetirementTime to finish -- instead the scheduler from which all the jobs orginated is sending this "command 404 (DEACTIVATE_CLAIM_FORCIBLY)" to the startd running the job and it's being killed off.
On my startd machine I have the following log output (see below). Following the vm2 starter you can see it gets a claim, starts running it, moves it to retirement/preemption because of rank and then gets a forcible deactivate command from the schedd machine (137.57.142.51). Why is the schedd calling this command on the startd? Why is in not retiring the job and allowing it to continue through the MaxJobRetirementTime (which is set to 2 days)?
We are a little baffled at how we got into this mess. With our RANK based on queue time we're seeing jobs just trashing in our system.
Thanks in advance for any insight you can offer!
Ian
11/2 14:16:20 DaemonCore: Command received via UDP from host <137.57.176.9:35511> 11/2 14:16:20 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info) 11/2 14:16:20 vm2: match_info called 11/2 14:16:20 vm2: Received match <137.57.176.186:2147>#1099203682#159 11/2 14:16:20 vm2: State change: match notification protocol successful 11/2 14:16:20 vm2: Changing state: Unclaimed -> Matched 11/2 14:16:20 DaemonCore: Command received via TCP from host <137.57.142.51:3349> 11/2 14:16:20 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim) 11/2 14:16:20 vm2: Request accepted. 11/2 14:16:20 vm2: Remote owner is gquan@xxxxxxxxxx 11/2 14:16:20 vm2: State change: claiming protocol successful 11/2 14:16:20 vm2: Changing state: Matched -> Claimed 11/2 14:16:27 DaemonCore: Command received via TCP from host <137.57.142.51:3365> 11/2 14:16:27 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 11/2 14:16:27 vm2: Got activate_claim request from shadow (<137.57.142.51:3365>) 11/2 14:16:27 vm2: Remote job ID is 3.3 11/2 14:16:27 vm2: Got universe "VANILLA" (5) from request classad 11/2 14:16:27 vm2: State change: claim-activation protocol successful 11/2 14:16:27 vm2: Changing activity: Idle -> Busy 11/2 14:21:23 DaemonCore: Command received via TCP from host <137.57.176.186:2309> 11/2 14:21:23 DaemonCore: received command 448 (GIVE_STATE), calling handler (command_give_state) 11/2 14:32:03 DaemonCore: Command received via TCP from host <137.57.142.51:3699> 11/2 14:32:03 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 11/2 14:32:03 vm1: Called deactivate_claim_forcibly() 11/2 14:32:03 DaemonCore: Command received via UDP from host <137.57.142.51:3702> 11/2 14:32:03 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim) 11/2 14:32:03 vm1: State change: received RELEASE_CLAIM command 11/2 14:32:03 vm1: Changing state and activity: Claimed/Busy -> Preempting/Vacating 11/2 14:32:03 DaemonCore: Command received via UDP from host <137.57.142.51:3703> 11/2 14:32:03 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim) 11/2 14:32:03 vm1: Got RELEASE_CLAIM while in Preempting state, ignoring. 11/2 14:32:04 DaemonCore: Command received via UDP from host <137.57.176.186:2333> 11/2 14:32:04 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand()) 11/2 14:32:04 Starter pid 1896 exited with status 0 11/2 14:32:13 vm1: State change: starter exited 11/2 14:32:13 vm1: State change: No preempting claim, returning to owner 11/2 14:32:13 vm1: Changing state and activity: Preempting/Vacating -> Owner/Idle 11/2 14:32:13 vm1: State change: IS_OWNER is false 11/2 14:32:13 vm1: Changing state: Owner -> Unclaimed 11/2 14:32:47 DaemonCore: Command received via TCP from host <137.57.142.51:3726> 11/2 14:32:47 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 11/2 14:32:47 vm2: Called deactivate_claim_forcibly() 11/2 14:32:47 DaemonCore: Command received via UDP from host <137.57.176.186:2343> 11/2 14:32:47 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand()) 11/2 14:32:47 Starter pid 1024 exited with status 0 11/2 14:32:54 vm2: State change: starter exited 11/2 14:32:58 vm2: Changing activity: Busy -> Idle 11/2 14:32:58 DaemonCore: Command received via UDP from host <137.57.142.51:3729> 11/2 14:32:58 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim) 11/2 14:32:58 vm2: State change: received RELEASE_CLAIM command 11/2 14:32:58 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating 11/2 14:32:58 vm2: State change: No preempting claim, returning to owner 11/2 14:32:58 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle 11/2 14:32:58 vm2: State change: IS_OWNER is false 11/2 14:32:58 vm2: Changing state: Owner -> Unclaimed 11/2 14:32:58 DaemonCore: Command received via UDP from host <137.57.142.51:3730> 11/2 14:32:58 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim) 11/2 14:32:58 Error: can't find resource with ClaimId (<137.57.176.186:2147>#1099203682#159)
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users