Hi, alongside our large pool, we currently operate a much smaller pool with a single DedicatedScheduler. Most of the time, jobs get scheduled nicely, but sometimes jobs are stalled in idle, even though AllRemoteHosts is already containing the target slots. Some/Most of the StartLog entries look fine: 09/30/20 12:55:52 slot1_1: State change: IS_OWNER is false 09/30/20 12:55:52 slot1_1: Changing state: Owner -> Unclaimed 09/30/20 12:55:52 slot1_1: Changing state: Unclaimed -> Delete 09/30/20 12:55:52 slot1_1: Resource no longer needed, deleting 09/30/20 13:01:09 slot1_1: New machine resource of type -1 allocated 09/30/20 13:01:09 Setting up slot pairings 09/30/20 13:01:09 slot1_1: Request accepted. 09/30/20 13:01:09 slot1_1: Remote owner is USER 09/30/20 13:01:09 slot1_1: State change: claiming protocol successful 09/30/20 13:01:09 slot1_1: Changing state: Owner -> Claimed but others complain about problems: (10.10.74.9 is this execute node, 10.20.30.23 is the schedd, 10.20.60.53 the central manager): [...] 09/30/20 12:53:32 Starter pid 74793 exited with status 0 09/30/20 12:53:32 slot1_1: State change: starter exited 09/30/20 12:53:32 slot1_1: State change: No preempting claim, returning to owner 09/30/20 12:53:32 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle 09/30/20 12:53:32 slot1_1: State change: IS_OWNER is false 09/30/20 12:53:32 slot1_1: Changing state: Owner -> Unclaimed 09/30/20 12:53:32 slot1_1: Changing state: Unclaimed -> Delete 09/30/20 12:53:32 slot1_1: Resource no longer needed, deleting 09/30/20 12:55:42 slot1_1: New machine resource of type -1 allocated 09/30/20 12:55:42 Setting up slot pairings 09/30/20 12:55:42 slot1_1: Request accepted. 09/30/20 12:55:42 slot1_1: Remote owner is USER 09/30/20 12:55:42 slot1_1: State change: claiming protocol successful 09/30/20 12:55:42 slot1_1: Changing state: Owner -> Claimed 09/30/20 12:55:52 Error: can't find resource with ClaimId (<10.10.74.9:9618?addrs=10.10.74.9-9618&noUDP&sock=56893_71be_3>#1600368105#136#...) for 444 (ACTIVATE_CLAIM) 09/30/20 12:55:52 Error: can't find resource with ClaimId (<10.10.74.9:9618?addrs=10.10.74.9-9618&noUDP&sock=56893_71be_3>#1600368105#136#...) -- perhaps this claim was already removed? 09/30/20 12:55:52 Error: problem finding resource for 403 (DEACTIVATE_CLAIM) 09/30/20 12:55:52 Can't read ClaimId 09/30/20 12:55:52 condor_write(): Socket closed when trying to write 29 bytes to <10.20.30.23:31387>, fd is 11 09/30/20 12:55:52 Buf::write(): condor_write() failed 09/30/20 13:05:54 Can't read ClaimId 09/30/20 13:15:42 slot1_1: State change: claim no longer recognized by the schedd - removing claim 09/30/20 13:15:42 slot1_1: Changing state and activity: Claimed/Idle -> Preempting/Killing 09/30/20 13:15:42 slot1_1: State change: No preempting claim, returning to owner 09/30/20 13:15:42 slot1_1: Changing state and activity: Preempting/Killing -> Owner/Idle 09/30/20 13:15:42 slot1_1: State change: IS_OWNER is false 09/30/20 13:15:42 slot1_1: Changing state: Owner -> Unclaimed 09/30/20 13:15:42 slot1_1: Changing state: Unclaimed -> Delete 09/30/20 13:15:42 slot1_1: Resource no longer needed, deleting 09/30/20 13:16:22 slot1_1: New machine resource of type -1 allocated 09/30/20 13:16:22 Setting up slot pairings 09/30/20 13:16:22 slot1_1: Request accepted. 09/30/20 13:16:22 slot1_1: Remote owner is USER 09/30/20 13:16:22 slot1_1: State change: claiming protocol successful 09/30/20 13:16:22 slot1_1: Changing state: Owner -> Claimed 09/30/20 13:16:22 Job no longer matches partitionable slot after MODIFY_REQUEST_EXPR_ edits, retrying w/o edits 09/30/20 13:16:22 slot1: Partitionable slot can't be split to allocate a dynamic slot large enough for the claim 09/30/20 13:16:22 slot1: State change: claiming protocol failed 09/30/20 13:16:22 slot1: Changing state: Unclaimed -> Owner 09/30/20 13:16:22 slot1: State change: IS_OWNER is false 09/30/20 13:16:22 slot1: Changing state: Owner -> Unclaimed 09/30/20 13:21:33 slot1_1: Got activate_claim request from shadow (10.20.30.23) 09/30/20 13:21:33 slot1_1: Remote job ID is 960.0 09/30/20 13:21:33 slot1_1: Got universe "PARALLEL" (11) from request classad 09/30/20 13:21:33 slot1_1: State change: claim-activation protocol successful 09/30/20 13:21:33 slot1_1: Changing activity: Idle -> Busy and then finally the job starts after ~half an hour in idle (claim?) mode. Any idea what the hold-up may be? Cheers and thanks in advance for any help/suggestions! Carsten PS: Perhaps important: In order to not waste the cycles of this small pool, we allow flocking from the main pool to this one along with a hopefully suitable preemption policy: MAIN1/2/3: FQDN of main pool schedd machines DSCHED: FQDN of the dedicated scheduler excerpt from condor_config_val -sum: FLOCK_FROM = MAIN1 MAIN2 MAIN3 ALLOW_NEGOTIATOR_SCHEDD = $(FLOCK_FROM),$(CONDOR_HOST) ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM) ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM) ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM) ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM) PREEMPTION_REQUIREMENTS = (MY.ClientMachine != 'DSCHED' && TARGET.ClientMachine == 'DSCHED') -- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany Phone: +49 511 762 17185
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature