Hi HTCondor Experts, Recently we are
experiencing machine crashing because of OOM. Each worker in our
cluster has 128GB memory, and each has 3072MB reserved memory that
cannot be used by HTCondor:
RESERVED_MEMORYÂ Â Â Â
Â= 3072
In addition, each worker
has 1 partitionable slot defined as below:
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = TRUE
However, if you add the dynamic slot size shown below in the
second last column (MB), you will get 128,723MB. Condor
obviously does not subtract 3072MB (RESERVED_MEMORY) from all
the physical memory of the machine.
master1:4} condor_status | grep worker1Â
Slot1@worker1  LINUX   X86_64 Unclaimed Idle  Â
1.000Â Â339Â 0+03:18:58
slot1_1@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.020 15360Â 0+00:00:03
slot1_2@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.000 15104Â 0+00:07:57
slot1_3@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.000Â 4096Â 0+00:02:20
slot1_4@worker1 LINUX   X86_64 Claimed ÂBusy  Â
0.650Â 5888Â 0+00:08:08
slot1_5@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.020Â 8064Â 0+00:00:32
slot1_6@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.000 20096Â 0+00:00:03
slot1_7@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.010Â 5888Â 0+00:00:32
slot1_8@worker1 LINUX   X86_64 Claimed ÂBusy  Â
0.000 20096Â 0+00:36:00
slot1_9@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.000Â 4096Â 0+00:00:06
slot1_10@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.040Â 4096Â 0+00:00:03
slot1_11@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.010Â 5120Â 0+00:00:03
slot1_12@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.000 15360Â 0+00:02:08
slot1_13@worker1 LINUX   X86_64 Claimed ÂBusy  Â
1.000Â 5120Â 0+00:04:14
My question is why RESERVED_MEMORY is not considered by
HTCondor in this case.
Thank you in advance,
Jewel
|