Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] jobs still landing on the machine even after condor_drain
- Date: Wed, 13 Aug 2014 10:23:43 +0100
- From: qing <gang.qin@xxxxxxxxxxxxx>
- Subject: [HTCondor-users] jobs still landing on the machine even after condor_drain
Dear Expert:
In our cluster we are using condor 8.2.1 and I use partitionable
slots on the execute machine.
## Partitionable slots
NUM_SLOTS = 1
SLOT_TYPE_1 = cpus=100%,mem=100%,auto
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1_PARTITIONABLE = TRUE
# Consumption policy
CONSUMPTION_POLICY = True
SLOT_TYPE_1_CONSUMPTION_POLICY = True
SLOT_TYPE_1_CONSUMPTION_CPUS = TARGET.RequestCpus
SLOT_TYPE_1_CONSUMPTION_MEMORY = TARGET.RequestMemory
SLOT_TYPE_1_CONSUMPTION_DISK = TARGET.RequestDisk
SLOT_WEIGHT = Cpus
USE_PID_NAMESPACES = False
Yesterday we plan to reboot node002 in the condor cluster, so in the
afternoon I used 'condor_drain node002.beowulf.cluster' to prevent new
jobs to be sent to node002, then I see the the following change in
condor_status:
slot1@xxxxxxxxxxxx LINUX X86_64 Drained Retiring 0.020 27750
0+00:00:04
slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Retiring 8.000 2000
0+00:00:04
slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Retiring 8.000 2000
0+00:00:04
slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Retiring 7.990 2000
0+00:00:04
node002 is a 24-core machine and each job running there is a 8-core job.
And after some time it changed to:
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.020 27750
0+00:02:56
slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 7.890 2000
0+00:02:57
slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 7.810 2000
0+00:02:57
slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Busy 8.240 2000
0+00:02:57
During the time period It's still the same 3 jobs running on node002,
id as following:
vr019:~# condor_q -run | grep node002
53772.0 prdatlas089 8/12 04:02 0+04:22:08
slot1@xxxxxxxxxxxxxxxxxxxxxxx
54203.0 prdatlas047 8/12 11:00 0+04:12:48
slot1@xxxxxxxxxxxxxxxxxxxxxxx
54204.0 prdatlas047 8/12 11:00 0+04:12:03
slot1@xxxxxxxxxxxxxxxxxxxxxxx
I suppose that when these 3 jobs get finished, no more new jobs will be
sent to node002, but this morning I find new jobs are still landing on
node002:
svr019:~# condor_q -run | grep node002
54612.0 prdatlas089 8/12 18:55 0+01:59:12
slot1@xxxxxxxxxxxxxxxxxxxxxxx
54700.0 prdatlas089 8/12 20:38 0+01:57:21
slot1@xxxxxxxxxxxxxxxxxxxxxxx
54709.0 prdatlas089 8/12 20:43 0+00:10:24
slot1@xxxxxxxxxxxxxxxxxxxxxxx
So looks to me the condor_drain did take some drain action on the
execution machine but later the machine was automatically brought back
online, not sure if it's related to the partitionable slots setting or
something else, any idea?
Cheers,Gang