[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Problems mixing IsDesktop and Whole Machine slots



I'm currently using a setup similar to the one in outlined here: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots

 

The non whole-machine slot is a partitionable slot for smaller single- or multi-core jobs that don't require all of the machines assets to run. I have a mixture of IsDesktop and non IsDesktop machines in our condor pool. The idea is to have a dedicated subset of computers (rackmount computers) that cannot be suspended due to owner activity in addition to normal workstations. To date this has worked really well except in one condition. When a non IsDesktop machine (IsDesktop =?= False) partitionable slot is busy and a whole machine job is added to the queue, the partitionable slot does not suspend/vacate its jobs. In effect, condor causes resource conflicts by allowing both slot types to run on server only machines. I'm sure that a tweak to the logic of the SUSPEND variable would fix this problem. I'm not sure of the cleanest way to modify my config without breaking other working features. Attached below is the local config file for the server nodes I appreciate any help you all can provide.

 

 

Michael McInerny Murphy

Engineer

IERUS Technologies, Inc.

2904 Westcorp Blvd., Suite 210

(256) 319-2026 x 107

DAEMON_LIST = MASTER, STARTD

NETWORK_INTERFACE=eth0

# GPU features
use feature : GPUs
GPU_DISCOVERY_EXTRA = -extra

# This machine is a server with preemption conditionally disabled
IsDesktop = False
STARTD_ATTRS = $(STARTD_ATTRS) IsDesktop

# we will double-allocate resources to overlapping slots
COUNT_HYPERTHREAD_CPUS = True
NUM_CPUS = $(DETECTED_CORES)*2
MEMORY = $(DETECTED_MEMORY)*2

# single-core slots get 1 core each
SLOT_TYPE_1 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY), gpus=0
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_1 = 1

# whole-machine slot gets as many cores and RAM as the machine has
WHOLE_MACHINE_SLOT = 2
SLOT_TYPE_2 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY), gpus=auto
SLOT_TYPE_2_PARTITIONABLE = False
NUM_SLOTS_TYPE_2 = 1

# ClassAd attribute that is True/False depending on whether this slot is
# the whole-machine slot
CAN_RUN_WHOLE_MACHINE = SlotID == $(WHOLE_MACHINE_SLOT)
STARTD_EXPRS = $(STARTD_EXPRS) CAN_RUN_WHOLE_MACHINE

# advertise state of each slot as SlotX_State in ClassAds of all other slots
STARTD_SLOT_ATTRS = $(STARTD_SLOT_ATTRS) State

# Macro for referencing state of the whole-machine slot.
# Relies on eval(), which was added in HTCondor 7.3.2.
WHOLE_MACHINE_SLOT_STATE = \
  eval(strcat("Slot",$(WHOLE_MACHINE_SLOT),"_State"))

# Macro that is true if the partitionable slot is claimed
# WARNING: THERE MUST BE AN ENTRY FOR ALL SLOTS
# IN THE EXPRESSION BELOW.  If you have more slots, you must
# extend this expression to cover them.  If you have fewer
# slots, extra entries are harmless.
PARTITIONABLE_SLOT_CLAIMED = \
  ($(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed") < \
  (Slot1_State =?= "Claimed")

# Non-whole-machine jobs must run on the partitionable slot
START_PARTITIONABLE_SLOT_JOB = \
  TARGET.RequiresWholeMachine =!= True && MY.CAN_RUN_WHOLE_MACHINE == False && \
  $(WHOLE_MACHINE_SLOT_STATE) =!= "Claimed"

# Whole-machine jobs must run on the whole-machine slot
START_WHOLE_MACHINE_JOB = \
  TARGET.RequiresWholeMachine =?= True && MY.CAN_RUN_WHOLE_MACHINE

START = ($(START)) && ( \
  ($(START_PARTITIONABLE_SLOT_JOB)) || \
  ($(START_WHOLE_MACHINE_JOB)) )

# Suspend the whole-machine job until single-core jobs finish.
#SUSPEND = ($(SUSPEND)) || ( \
#  MY.CAN_RUN_WHOLE_MACHINE && ($(PARTITIONABLE_SLOT_CLAIMED)) )

# Suspend single-core jobs while the whole-machine job runs
SUSPEND = ($(SUSPEND)) || ( \
    MY.CAN_RUN_WHOLE_MACHINE =!= True && $(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed" )

CONTINUE = ($(SUSPEND)) =!= True

WANT_SUSPEND = ($(WANT_SUSPEND)) || ($(SUSPEND))