[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Trying to understand DedicatedScheduler related problems



Hi,

alongside our large pool, we currently operate a much smaller pool with
a single DedicatedScheduler. Most of the time, jobs get scheduled
nicely, but sometimes jobs are stalled in idle, even though
AllRemoteHosts is already containing the target slots.

Some/Most of the StartLog entries look fine:

09/30/20 12:55:52 slot1_1: State change: IS_OWNER is false
09/30/20 12:55:52 slot1_1: Changing state: Owner -> Unclaimed
09/30/20 12:55:52 slot1_1: Changing state: Unclaimed -> Delete
09/30/20 12:55:52 slot1_1: Resource no longer needed, deleting
09/30/20 13:01:09 slot1_1: New machine resource of type -1 allocated
09/30/20 13:01:09 Setting up slot pairings
09/30/20 13:01:09 slot1_1: Request accepted.
09/30/20 13:01:09 slot1_1: Remote owner is USER
09/30/20 13:01:09 slot1_1: State change: claiming protocol successful
09/30/20 13:01:09 slot1_1: Changing state: Owner -> Claimed

but others complain about problems: (10.10.74.9 is this execute node,
10.20.30.23 is the schedd, 10.20.60.53 the central manager):

[...]
09/30/20 12:53:32 Starter pid 74793 exited with status 0
09/30/20 12:53:32 slot1_1: State change: starter exited
09/30/20 12:53:32 slot1_1: State change: No preempting claim, returning
to owner
09/30/20 12:53:32 slot1_1: Changing state and activity:
Preempting/Vacating -> Owner/Idle
09/30/20 12:53:32 slot1_1: State change: IS_OWNER is false
09/30/20 12:53:32 slot1_1: Changing state: Owner -> Unclaimed
09/30/20 12:53:32 slot1_1: Changing state: Unclaimed -> Delete
09/30/20 12:53:32 slot1_1: Resource no longer needed, deleting
09/30/20 12:55:42 slot1_1: New machine resource of type -1 allocated
09/30/20 12:55:42 Setting up slot pairings
09/30/20 12:55:42 slot1_1: Request accepted.
09/30/20 12:55:42 slot1_1: Remote owner is USER
09/30/20 12:55:42 slot1_1: State change: claiming protocol successful
09/30/20 12:55:42 slot1_1: Changing state: Owner -> Claimed
09/30/20 12:55:52 Error: can't find resource with ClaimId
(<10.10.74.9:9618?addrs=10.10.74.9-9618&noUDP&sock=56893_71be_3>#1600368105#136#...)
for 444 (ACTIVATE_CLAIM)
09/30/20 12:55:52 Error: can't find resource with ClaimId
(<10.10.74.9:9618?addrs=10.10.74.9-9618&noUDP&sock=56893_71be_3>#1600368105#136#...)
-- perhaps this claim was already removed?
09/30/20 12:55:52 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
09/30/20 12:55:52 Can't read ClaimId
09/30/20 12:55:52 condor_write(): Socket closed when trying to write 29
bytes to <10.20.30.23:31387>, fd is 11
09/30/20 12:55:52 Buf::write(): condor_write() failed
09/30/20 13:05:54 Can't read ClaimId
09/30/20 13:15:42 slot1_1: State change: claim no longer recognized by
the schedd - removing claim
09/30/20 13:15:42 slot1_1: Changing state and activity: Claimed/Idle ->
Preempting/Killing
09/30/20 13:15:42 slot1_1: State change: No preempting claim, returning
to owner
09/30/20 13:15:42 slot1_1: Changing state and activity:
Preempting/Killing -> Owner/Idle
09/30/20 13:15:42 slot1_1: State change: IS_OWNER is false
09/30/20 13:15:42 slot1_1: Changing state: Owner -> Unclaimed
09/30/20 13:15:42 slot1_1: Changing state: Unclaimed -> Delete
09/30/20 13:15:42 slot1_1: Resource no longer needed, deleting
09/30/20 13:16:22 slot1_1: New machine resource of type -1 allocated
09/30/20 13:16:22 Setting up slot pairings
09/30/20 13:16:22 slot1_1: Request accepted.
09/30/20 13:16:22 slot1_1: Remote owner is USER
09/30/20 13:16:22 slot1_1: State change: claiming protocol successful
09/30/20 13:16:22 slot1_1: Changing state: Owner -> Claimed
09/30/20 13:16:22 Job no longer matches partitionable slot after
MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
09/30/20 13:16:22 slot1: Partitionable slot can't be split to allocate a
dynamic slot large enough for the claim
09/30/20 13:16:22 slot1: State change: claiming protocol failed
09/30/20 13:16:22 slot1: Changing state: Unclaimed -> Owner
09/30/20 13:16:22 slot1: State change: IS_OWNER is false
09/30/20 13:16:22 slot1: Changing state: Owner -> Unclaimed
09/30/20 13:21:33 slot1_1: Got activate_claim request from shadow
(10.20.30.23)
09/30/20 13:21:33 slot1_1: Remote job ID is 960.0
09/30/20 13:21:33 slot1_1: Got universe "PARALLEL" (11) from request classad
09/30/20 13:21:33 slot1_1: State change: claim-activation protocol
successful
09/30/20 13:21:33 slot1_1: Changing activity: Idle -> Busy

and then finally the job starts after ~half an hour in idle (claim?) mode.

Any idea what the hold-up may be?

Cheers and thanks in advance for any help/suggestions!

Carsten

PS: Perhaps important: In order to not waste the cycles of this small
pool, we allow flocking from the main pool to this one along with a
hopefully suitable preemption policy:

MAIN1/2/3: FQDN of main pool schedd machines
DSCHED: FQDN of the dedicated scheduler

excerpt from condor_config_val -sum:

FLOCK_FROM = MAIN1 MAIN2 MAIN3
ALLOW_NEGOTIATOR_SCHEDD = $(FLOCK_FROM),$(CONDOR_HOST)
ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
PREEMPTION_REQUIREMENTS = (MY.ClientMachine != 'DSCHED' &&
TARGET.ClientMachine == 'DSCHED')

-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature