I'm currently using a setup similar to the one in outlined here: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots
The non whole-machine slot is a partitionable slot for smaller single- or multi-core jobs that don't require all of the machines assets to run. I have a mixture of IsDesktop and non IsDesktop machines in our condor pool. The idea is to have a dedicated subset of computers (rackmount computers) that cannot be suspended due to owner activity in addition to normal workstations. To date this has worked really well except in one condition. When a non IsDesktop machine (IsDesktop =?= False) partitionable slot is busy and a whole machine job is added to the queue, the partitionable slot does not suspend/vacate its jobs. In effect, condor causes resource conflicts by allowing both slot types to run on server only machines. I'm sure that a tweak to the logic of the SUSPEND variable would fix this problem. I'm not sure of the cleanest way to modify my config without breaking other working features. Attached below is the local config file for the server nodes I appreciate any help you all can provide.
Michael McInerny Murphy Engineer IERUS Technologies, Inc. 2904 Westcorp Blvd., Suite 210 (256) 319-2026 x 107 |
DAEMON_LIST = MASTER, STARTD NETWORK_INTERFACE=eth0 # GPU features use feature : GPUs GPU_DISCOVERY_EXTRA = -extra # This machine is a server with preemption conditionally disabled IsDesktop = False STARTD_ATTRS = $(STARTD_ATTRS) IsDesktop # we will double-allocate resources to overlapping slots COUNT_HYPERTHREAD_CPUS = True NUM_CPUS = $(DETECTED_CORES)*2 MEMORY = $(DETECTED_MEMORY)*2 # single-core slots get 1 core each SLOT_TYPE_1 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY), gpus=0 SLOT_TYPE_1_PARTITIONABLE = True NUM_SLOTS_TYPE_1 = 1 # whole-machine slot gets as many cores and RAM as the machine has WHOLE_MACHINE_SLOT = 2 SLOT_TYPE_2 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY), gpus=auto SLOT_TYPE_2_PARTITIONABLE = False NUM_SLOTS_TYPE_2 = 1 # ClassAd attribute that is True/False depending on whether this slot is # the whole-machine slot CAN_RUN_WHOLE_MACHINE = SlotID == $(WHOLE_MACHINE_SLOT) STARTD_EXPRS = $(STARTD_EXPRS) CAN_RUN_WHOLE_MACHINE # advertise state of each slot as SlotX_State in ClassAds of all other slots STARTD_SLOT_ATTRS = $(STARTD_SLOT_ATTRS) State # Macro for referencing state of the whole-machine slot. # Relies on eval(), which was added in HTCondor 7.3.2. WHOLE_MACHINE_SLOT_STATE = \ eval(strcat("Slot",$(WHOLE_MACHINE_SLOT),"_State")) # Macro that is true if the partitionable slot is claimed # WARNING: THERE MUST BE AN ENTRY FOR ALL SLOTS # IN THE EXPRESSION BELOW. If you have more slots, you must # extend this expression to cover them. If you have fewer # slots, extra entries are harmless. PARTITIONABLE_SLOT_CLAIMED = \ ($(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed") < \ (Slot1_State =?= "Claimed") # Non-whole-machine jobs must run on the partitionable slot START_PARTITIONABLE_SLOT_JOB = \ TARGET.RequiresWholeMachine =!= True && MY.CAN_RUN_WHOLE_MACHINE == False && \ $(WHOLE_MACHINE_SLOT_STATE) =!= "Claimed" # Whole-machine jobs must run on the whole-machine slot START_WHOLE_MACHINE_JOB = \ TARGET.RequiresWholeMachine =?= True && MY.CAN_RUN_WHOLE_MACHINE START = ($(START)) && ( \ ($(START_PARTITIONABLE_SLOT_JOB)) || \ ($(START_WHOLE_MACHINE_JOB)) ) # Suspend the whole-machine job until single-core jobs finish. #SUSPEND = ($(SUSPEND)) || ( \ # MY.CAN_RUN_WHOLE_MACHINE && ($(PARTITIONABLE_SLOT_CLAIMED)) ) # Suspend single-core jobs while the whole-machine job runs SUSPEND = ($(SUSPEND)) || ( \ MY.CAN_RUN_WHOLE_MACHINE =!= True && $(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed" ) CONTINUE = ($(SUSPEND)) =!= True WANT_SUSPEND = ($(WANT_SUSPEND)) || ($(SUSPEND))