Hi all, we set-up a small dedicated pool with 24 machines of type A and 21 machines of type B. A user now submits a parallel universe jobs, requesting to run on all 24 machines of type A by setting request_cpus to the maximum type A supports (large then type B supports). As the pool was idle, the negotiator quickly matched the 24 nodes to the job which can be seen via $ condor_q 63.0 -af RemoteHosts | xargs -d , -n1 echo|grep -c slot 24 so far, so good. All nodes partition off the subslot 1_1 but for the next 10 minutes nothing really happens, the StartLog does not contain a hint, the StarterLog.slot1_1 nothing at all (as nothing was really started yet). After 10 minutes the claim seems to be deleted on the starter and a few minutes later the negotiator tries to match the same resources again, however, the subslot is still present and won't be preempted[1]. And now the jobs simply stay idle. Some data points: $condor_q -bet 63 -- Schedd: condor8.atlas.local : <10.20.30.23:2653> The Requirements expression for job 63.000 is (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer)) Job 63.000 defines the following attributes: DiskUsage = 1 FileSystemDomain = "atlas.local" RequestCpus = 128 RequestDisk = DiskUsage RequestMemory = 40960 The Requirements expression for job 63.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 90 TARGET.Arch == "X86_64" [1] 90 TARGET.OpSys == "LINUX" [3] 90 TARGET.Disk >= RequestDisk [5] 90 TARGET.Memory >= RequestMemory [7] 24 TARGET.Cpus >= RequestCpus [9] 90 TARGET.FileSystemDomain == MY.FileSystemDomain Last successful match: Mon Jul 13 19:01:19 2020 Last failed match: Mon Jul 13 19:14:47 2020 Reason for last match failure: PREEMPTION_REQUIREMENTS == False 063.000: Run analysis summary ignoring user priority. Of 45 machines, 21 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 24 are able to run your job ------------------------------------------------ StartLog on one of the nodes: 07/13/20 19:01:14 slot1_1: New machine resource of type -1 allocated 07/13/20 19:01:14 Setting up slot pairings 07/13/20 19:01:14 slot1_1: Request accepted. 07/13/20 19:01:14 slot1_1: Remote owner is USER@xxxxxxxxxxx 07/13/20 19:01:14 slot1_1: State change: claiming protocol successful 07/13/20 19:01:14 slot1_1: Changing state: Owner -> Claimed 07/13/20 19:01:21 slot1_1: Called deactivate_claim() 07/13/20 19:01:21 Can't read ClaimId 07/13/20 19:01:21 condor_write(): Socket closed when trying to write 29 bytes to <10.20.30.23:16847>, fd is 11 07/13/20 19:01:21 Buf::write(): condor_write() failed 07/13/20 19:11:14 slot1_1: State change: claim no longer recognized by the schedd - removing claim 07/13/20 19:11:14 slot1_1: Changing state and activity: Claimed/Idle -> Preempting/Killing 07/13/20 19:11:14 slot1_1: State change: No preempting claim, returning to owner 07/13/20 19:11:14 slot1_1: Changing state and activity: Preempting/Killing -> Owner/Idle 07/13/20 19:11:14 slot1_1: State change: IS_OWNER is false 07/13/20 19:11:14 slot1_1: Changing state: Owner -> Unclaimed 07/13/20 19:11:14 slot1_1: Changing state: Unclaimed -> Delete 07/13/20 19:11:14 slot1_1: Resource no longer needed, deleting 07/13/20 19:14:46 slot1_1: New machine resource of type -1 allocated 07/13/20 19:14:46 Setting up slot pairings 07/13/20 19:14:47 slot1_1: Request accepted. 07/13/20 19:14:47 slot1_1: Remote owner is USER@xxxxxxxxxxx 07/13/20 19:14:47 slot1_1: State change: claiming protocol successful 07/13/20 19:14:47 slot1_1: Changing state: Owner -> Claimed 07/13/20 19:14:47 Job no longer matches partitionable slot after MODIFY_REQUEST_EXPR_ edits, retrying w/o edits 07/13/20 19:14:47 slot1: Partitionable slot can't be split to allocate a dynamic slot large enough for the claim 07/13/20 19:14:47 slot1: State change: claiming protocol failed 07/13/20 19:14:47 slot1: Changing state: Unclaimed -> Owner 07/13/20 19:14:47 slot1: State change: IS_OWNER is false 07/13/20 19:14:47 slot1: Changing state: Owner -> Unclaimed ------------------------------------- apart from the network communication glitch (which I see on various nodes at that time), there is nothing what really points to an immediate problem - at least for me. Anyone with an idea what is wrong here? Cheers Carsten [1] our preemption policy in this small pool is simple, only jobs whose JobUniverse is not 11 (parallel) may be preempted which in this case is not the case -- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany Phone: +49 511 762 17185
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature