Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Dynamic Slots in Parallel Universe
- Date: Wed, 22 Nov 2017 16:10:59 +0000
- From: Duncan Brown <dabrown@xxxxxxx>
- Subject: [HTCondor-users] Dynamic Slots in Parallel Universe
Hi Todd,
I'm having a problem with jobs claims not succeeding in the parallel universe with dynamic slots. With an 8.6.7 Schedd/Negotiator/Startd setup, I'm seeing pslot preemption work as I'd expect:
> 11/22/17 10:38:20 Phase 4.1: Negotiating with schedds ...
> 11/22/17 10:38:20 Negotiating with DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx at <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf>
> 11/22/17 10:38:20 0 seconds so far for this submitter
> 11/22/17 10:38:20 0 seconds so far for this schedd
> 11/22/17 10:38:20 Request 3486245.00000: autocluster -1 (request count 1 of 0)
> 11/22/17 10:38:20 Matched 3486245.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf> preempting 1 dslots <10.5.182.235:9618?addrs=10.5.182.235-9618+[--1]-9618&noUDP&sock=2006_8712_3> slot1@CRUSH-SUGWG-OSG-10-5-182-235
> 11/22/17 10:38:20 Successfully matched with slot1@CRUSH-SUGWG-OSG-10-5-182-235
> 11/22/17 10:38:20 Request 3486245.00000: autocluster -1 (request count 1 of 0)
> 11/22/17 10:38:20 Matched 3486245.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf> preempting none <10.5.183.41:9618?addrs=10.5.183.41-9618+[--1]-9618&noUDP&sock=1879_6a8b_3> slot1@CRUSH-SUGWG-OSG-10-5-183-41
> 11/22/17 10:38:20 Successfully matched with slot1@CRUSH-SUGWG-OSG-10-5-183-41
> 11/22/17 10:38:20 Request 3486245.00000: autocluster -1 (request count 1 of 0)
> 11/22/17 10:38:20 Matched 3486245.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf> preempting none <10.5.182.218:9618?addrs=10.5.182.218-9618+[--1]-9618&noUDP&sock=2060_912f_3> slot1@CRUSH-SUGWG-OSG-10-5-182-218
And I'm seeing slots claimed by the dedicated scheduler:
> [dbrown@sugwg-condor ~]$ condor_userprio -all
> Last Priority Update: 11/22 11:00
> Effective Real Priority Res Total Usage Usage Last Time Since
> User Name Priority Priority Factor In Use (wghted-hrs) Start Time Usage Time Last Usage
> ------------------------------------------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ----------
> DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx 4.38 4.38 1.00 184 418.79 11/02/2017 16:40 11/22/2017 10:38 0+00:21
But the claim is failing to start. On the schedd I see:
> 11/22/17 10:50:11 (pid:2474326) Request was NOT accepted for claim slot1@CRUSH-SUGWG-OSG-10-5-182-246 <10.5.182.246:9618?addrs=10.5.182.246-9618+[--1]-9618&noUDP&sock=1998_0a50_3> for DedicatedScheduler -1.-1
and on one of the the starter I see:
> 11/22/17 10:50:11 slot1_1: New machine resource of type -1 allocated
> 11/22/17 10:50:11 slot1: Changing state: Owner -> Unclaimed
> 11/22/17 10:50:11 slot1: State change: IS_OWNER is TRUE
> 11/22/17 10:50:11 slot1: Changing state: Unclaimed -> Owner
> 11/22/17 10:50:11 Setting up slot pairings
> 11/22/17 10:50:11 slot1_1: Request accepted.
> 11/22/17 10:50:11 slot1_1: Remote owner is steven.reyes@xxxxxxxxxxxxxxxxxxxxxxxxx
> 11/22/17 10:50:11 slot1_1: State change: claiming protocol successful
> 11/22/17 10:50:11 slot1_1: Changing state: Owner -> Claimed
> 11/22/17 10:50:11 Job no longer matches partitionable slot after MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
> 11/22/17 10:50:11 slot1: Partitionable slot can't be split to allocate a dynamic slot large enough for the claim
> 11/22/17 10:50:11 slot1: Claiming protocol failed
This is strange, as the node has been partitioned and the resources are available:
> [dbrown@sugwg-condor ~]$ condor_status -long slot1_1@CRUSH-SUGWG-OSG-10-5-182-246 | grep TotalSlot
> TotalSlotCpus = 3
> TotalSlotDisk = 181808.0
> TotalSlotMemory = 35840
> TotalSlots = 2
>
> [dbrown@sugwg-condor ~]$ condor_status -long slot1_1@CRUSH-SUGWG-OSG-10-5-182-246 | grep RemoteOwner
> RemoteOwner = "steven.reyes@xxxxxxxxxxxxxxxxxxxxxxxxx"
And these resources are enough to run the job:
> [dbrown@sugwg-condor ~]$ condor_q -better 3486245.0
>
>
> -- Schedd: sugwg-condor.phy.syr.edu : <10.5.2.4:9615?...
> The Requirements expression for job 3486245.000 is
>
> ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )
>
> Job 3486245.000 defines the following attributes:
>
> RequestCpus = 3
> RequestDisk = 1
> RequestMemory = 35840
>
> The Requirements expression for job 3486245.000 reduces to these conditions:
>
> Slots
> Step Matched Condition
> ----- -------- ---------
> [0] 12897 TARGET.Arch == "X86_64"
> [1] 12897 TARGET.OpSys == "LINUX"
> [3] 12897 TARGET.Disk >= RequestDisk
> [5] 142 TARGET.Memory >= RequestMemory
>
> No successful match recorded.
> Last failed match: Wed Nov 22 11:05:03 2017
>
> Reason for last match failure: no match found
>
> 3486245.000: Run analysis summary ignoring user priority. Of 12897 machines,
> 12727 are rejected by your job's requirements
> 126 reject your job because of their own requirements
> 28 are exhausted partitionable slots
> 8 match and are already running your jobs
> 8 match but are serving other users
> 0 are available to run your job
After a the claim timeout, it gives up the claim:
> 11/22/17 11:00:18 slot1_1: State change: received RELEASE_CLAIM command
> 11/22/17 11:00:18 slot1_1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
> 11/22/17 11:00:18 slot1_1: State change: No preempting claim, returning to owner
> 11/22/17 11:00:18 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 11/22/17 11:00:18 slot1_1: Changing state: Owner -> Delete
> 11/22/17 11:00:18 slot1_1: Resource no longer needed, deleting
Then rise and repeat every claim timeout seconds (600).
Any ideas?
Cheers,
Duncan.
--
Duncan Brown Room 263-1, Physics Department
Charles Brightman Professor of Physics Syracuse University, NY 13244
http://dabrown.expressions.syr.edu Phone: 315 443 5993