Hello, I have tons of single-core jobs to run with different memory request. As a preliminary work, I set up HTCondor on a single machine using a dynamic slot setup. My
goal is simple: keep cluster fully utilized. Initially, it seems that all memory of the target machine is allocated. However, after running several rounds, the unclaimed slot1 starts containing more and more
memory, and only few jobs are run in parallel, with a lot of jobs in the queue as idle state as follows. Those jobs memory request is between 500MB â 1600MB. So I am pretty sure the cluster should have run more jobs. $condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@DummyServer LINUX X86_64 Unclaimed Idle 0.000 84510 0+00:28:17 slot1_1@DummyServer LINUX X86_64 Claimed Busy 0.000 1000 0+00:00:01 slot1_2@DummyServer LINUX X86_64 Claimed Busy 0.000 1000 0+00:00:01 slot1_3@DummyServer LINUX X86_64 Claimed Busy 0.020 1000 0+00:00:04 slot1_4@DummyServer LINUX X86_64 Claimed Busy 0.010 1000 0+00:00:03 slot1_5@DummyServer LINUX X86_64 Claimed Busy 0.020 1000 0+00:00:05 slot1_6@DummyServer LINUX X86_64 Claimed Busy 0.010 1000 0+00:00:03 slot1_7@DummyServer LINUX X86_64 Claimed Busy 0.020 1000 0+00:00:02 slot1_8@DummyServer LINUX X86_64 Claimed Busy 0.000 1000 0+00:00:18 Total Owner Claimed Unclaimed Matched Preempting Backfill Drain X86_64/LINUX 9 0 8 1 0 0 0 0 Total 9 0 8 1 0 0 0 0 $condor_q OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS ââ ââ Total for query: 2138 jobs; 0 completed, 0 removed, 2130 idle, 8 running, 0 held, 0 suspended Total for all users: 2138 jobs; 0 completed, 0 removed, 2130 idle, 8 running, 0 held, 0 suspended I havenât done any fancy setup in condor_config.local. Iâve tried both: DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD NUM_SLOTS = 1 NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1 = cpus=100% SLOT_TYPE_1_PARTITIONABLE = true and NUM_SLOTS = 1 NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1 = cpus=100% SLOT_TYPE_1_PARTITIONABLE = true CLAIM_WORKLIFE =0 The cluster utilization is similar low in both setup scenario. For CLAIM_WORKLIEF = 0, I thought after each job completes, the corresponding claimed slot would be
returned back to the original unclaimed slot1 so that once a new job showing up, a new slot is created with relative memory allocation. Again, my job workload is mixed, and I donât think to keep a specific amount of static slots can meet my specification. Here is my sample submission file, executable = xxx.sh should_transfer_files = NO request_cpus = 1 request_memory = 1600 log = xxx.log output = xxx.txt Queue And here is my condor version: $CondorVersion: 8.8.3 May 26 2019 BuildID: 470254 $ $CondorPlatform: x86_64_Ubuntu18 $ Any comments and suggestions are appreciated. Best, Shunxing |