Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Partitionable Slot Starvation
- Date: Wed, 15 Aug 2012 17:49:23 -0400
- From: William Strecker-Kellogg <willsk@xxxxxxx>
- Subject: [Condor-users] Partitionable Slot Starvation
Hi all,
Is anyone out there using partitionable slots and have any experience
with the new condor_defrag daemon? I have set up a small test cluster
with condor-7.8.1 as follows:
1 CM running collector, negotiator, defrag
5 execute machines each running startd, schedd
Each execute machine is 24-core and configured thus:
SLOT_TYPE_1 = cpus=16, ram=2/3, swap=2/3, disk=2/3
SLOT_TYPE_2 = cpus=auto, ram=auto, swap=auto, disk=auto
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_1 = 1
NUM_SLOTS_TYPE_2 = 8
so we have a combination of partitionable and regular slots.
If I submit several thousand regular test (sleep 600) jobs, they fill up
the nodes and partition the 16-core slots into 16 dynamic slots as
expected. However if I then submit, as the same user, 20 test jobs that
have request_cpus=8, they never start running until the large back-log
of single-core jobs is completely finished (CLAIM_WORKLIFE is 1hr).
Enter the defrag daemon. As I understand it, it was designed to prevent
this kind of starvation. I configured it as follows:
DAEMON_LIST = $(DAEMON_LIST) DEFRAG
DEFRAG_INTERVAL = 90
DEFRAG_DRAINING_MACHINES_PER_HOUR = 12.0
DEFRAG_MAX_WHOLE_MACHINES = 4
DEFRAG_MAX_CONCURRENT_DRAINING = 4
on the central manager so that it should, after a few minutes, start
draining one or two machines. Instead, after three hours I only see this
on the logs:
08/15/12 16:58:52 Doing nothing, because number to drain in next 90s is
calculated to be 0.
I'm wondering if (1) I have misunderstood the purpose of the defrag
daemon, (2) mis-configured it, or (3) there is something wrong in the
behavior.
Does anyone who has had experience setting up & running this have any
pointers or feedback?
Thanks a lot,
-Will