[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Dynamic Slots in Parallel Universe



Hi Todd,

I'm having a problem with jobs claims not succeeding in the parallel universe with dynamic slots. With an 8.6.7 Schedd/Negotiator/Startd setup, I'm seeing pslot preemption work as I'd expect:

> 11/22/17 10:38:20 Phase 4.1:  Negotiating with schedds ...
> 11/22/17 10:38:20   Negotiating with DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx at <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf>
> 11/22/17 10:38:20 0 seconds so far for this submitter
> 11/22/17 10:38:20 0 seconds so far for this schedd
> 11/22/17 10:38:20     Request 3486245.00000: autocluster -1 (request count 1 of 0)
> 11/22/17 10:38:20       Matched 3486245.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf> preempting 1 dslots <10.5.182.235:9618?addrs=10.5.182.235-9618+[--1]-9618&noUDP&sock=2006_8712_3> slot1@CRUSH-SUGWG-OSG-10-5-182-235
> 11/22/17 10:38:20       Successfully matched with slot1@CRUSH-SUGWG-OSG-10-5-182-235
> 11/22/17 10:38:20     Request 3486245.00000: autocluster -1 (request count 1 of 0)
> 11/22/17 10:38:20       Matched 3486245.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf> preempting none <10.5.183.41:9618?addrs=10.5.183.41-9618+[--1]-9618&noUDP&sock=1879_6a8b_3> slot1@CRUSH-SUGWG-OSG-10-5-183-41
> 11/22/17 10:38:20       Successfully matched with slot1@CRUSH-SUGWG-OSG-10-5-183-41
> 11/22/17 10:38:20     Request 3486245.00000: autocluster -1 (request count 1 of 0)
> 11/22/17 10:38:20       Matched 3486245.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <128.230.190.43:9615?PrivAddr=%3c10.5.2.4:9615%3fsock%3d2474326_26bf%3e&PrivNet=SU_ITS&addrs=128.230.190.43-9615+[--1]-9615&noUDP&sock=2474326_26bf> preempting none <10.5.182.218:9618?addrs=10.5.182.218-9618+[--1]-9618&noUDP&sock=2060_912f_3> slot1@CRUSH-SUGWG-OSG-10-5-182-218


And I'm seeing slots claimed by the dedicated scheduler:

> [dbrown@sugwg-condor ~]$ condor_userprio -all
> Last Priority Update: 11/22 11:00
>                                              Effective     Real   Priority   Res   Total Usage       Usage             Last       Time Since
> User Name                                     Priority   Priority  Factor   In Use (wghted-hrs)    Start Time       Usage Time    Last Usage
> ------------------------------------------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ----------
> DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx         4.38     4.38      1.00    184       418.79 11/02/2017 16:40 11/22/2017 10:38    0+00:21

But the claim is failing to start. On the schedd I see:

> 11/22/17 10:50:11 (pid:2474326) Request was NOT accepted for claim slot1@CRUSH-SUGWG-OSG-10-5-182-246 <10.5.182.246:9618?addrs=10.5.182.246-9618+[--1]-9618&noUDP&sock=1998_0a50_3> for DedicatedScheduler -1.-1

and on one of the the starter I see:

> 11/22/17 10:50:11 slot1_1: New machine resource of type -1 allocated
> 11/22/17 10:50:11 slot1: Changing state: Owner -> Unclaimed
> 11/22/17 10:50:11 slot1: State change: IS_OWNER is TRUE
> 11/22/17 10:50:11 slot1: Changing state: Unclaimed -> Owner
> 11/22/17 10:50:11 Setting up slot pairings
> 11/22/17 10:50:11 slot1_1: Request accepted.
> 11/22/17 10:50:11 slot1_1: Remote owner is steven.reyes@xxxxxxxxxxxxxxxxxxxxxxxxx
> 11/22/17 10:50:11 slot1_1: State change: claiming protocol successful
> 11/22/17 10:50:11 slot1_1: Changing state: Owner -> Claimed
> 11/22/17 10:50:11 Job no longer matches partitionable slot after MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
> 11/22/17 10:50:11 slot1: Partitionable slot can't be split to allocate a dynamic slot large enough for the claim
> 11/22/17 10:50:11 slot1: Claiming protocol failed

This is strange, as the node has been partitioned and the resources are available:

> [dbrown@sugwg-condor ~]$ condor_status -long slot1_1@CRUSH-SUGWG-OSG-10-5-182-246 | grep TotalSlot
> TotalSlotCpus = 3
> TotalSlotDisk = 181808.0
> TotalSlotMemory = 35840
> TotalSlots = 2
> 

> [dbrown@sugwg-condor ~]$ condor_status -long slot1_1@CRUSH-SUGWG-OSG-10-5-182-246 | grep RemoteOwner
> RemoteOwner = "steven.reyes@xxxxxxxxxxxxxxxxxxxxxxxxx"

And these resources are enough to run the job:

> [dbrown@sugwg-condor ~]$ condor_q -better 3486245.0
> 
> 
> -- Schedd: sugwg-condor.phy.syr.edu : <10.5.2.4:9615?...
> The Requirements expression for job 3486245.000 is
> 
>     ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )
> 
> Job 3486245.000 defines the following attributes:
> 
>     RequestCpus = 3
>     RequestDisk = 1
>     RequestMemory = 35840
> 
> The Requirements expression for job 3486245.000 reduces to these conditions:
> 
>          Slots
> Step    Matched  Condition
> -----  --------  ---------
> [0]       12897  TARGET.Arch == "X86_64"
> [1]       12897  TARGET.OpSys == "LINUX"
> [3]       12897  TARGET.Disk >= RequestDisk
> [5]         142  TARGET.Memory >= RequestMemory
> 
> No successful match recorded.
> Last failed match: Wed Nov 22 11:05:03 2017
> 
> Reason for last match failure: no match found 
> 
> 3486245.000:  Run analysis summary ignoring user priority.  Of 12897 machines,
>   12727 are rejected by your job's requirements 
>     126 reject your job because of their own requirements 
>      28 are exhausted partitionable slots 
>       8 match and are already running your jobs 
>       8 match but are serving other users 
>       0 are available to run your job


After a the claim timeout, it gives up the claim:

> 11/22/17 11:00:18 slot1_1: State change: received RELEASE_CLAIM command
> 11/22/17 11:00:18 slot1_1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
> 11/22/17 11:00:18 slot1_1: State change: No preempting claim, returning to owner
> 11/22/17 11:00:18 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 11/22/17 11:00:18 slot1_1: Changing state: Owner -> Delete
> 11/22/17 11:00:18 slot1_1: Resource no longer needed, deleting

Then rise and repeat every claim timeout seconds (600).

Any ideas?

Cheers,
Duncan.

-- 

Duncan Brown                              Room 263-1, Physics Department
Charles Brightman Professor of Physics     Syracuse University, NY 13244
http://dabrown.expressions.syr.edu                   Phone: 315 443 5993