Dear Expert:
In our cluster we are using condor 8.2.1 and I use partitionable
slots on the execute machine.
## Partitionable slots
NUM_SLOTS = 1
SLOT_TYPE_1 = cpus=100%,mem=100%,auto
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1_PARTITIONABLE = TRUE
# Consumption policy
CONSUMPTION_POLICY = True
SLOT_TYPE_1_CONSUMPTION_POLICY = True
SLOT_TYPE_1_CONSUMPTION_CPUS = TARGET.RequestCpus
SLOT_TYPE_1_CONSUMPTION_MEMORY = TARGET.RequestMemory
SLOT_TYPE_1_CONSUMPTION_DISK = TARGET.RequestDisk
SLOT_WEIGHT = Cpus
USE_PID_NAMESPACES = False
Yesterday we plan to reboot node002 in the condor cluster, so in the
afternoon I used 'condor_drain node002.beowulf.cluster' to prevent
new jobs to be sent to node002, then I see the the following change in
condor_status:
slot1@xxxxxxxxxxxx LINUX X86_64 Drained Retiring 0.020 27750
0+00:00:04
slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Retiring 8.000 2000
0+00:00:04
slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Retiring 8.000 2000
0+00:00:04
slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Retiring 7.990 2000
0+00:00:04
node002 is a 24-core machine and each job running there is a 8-core
job.
And after some time it changed to:
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.020 27750
0+00:02:56
slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 7.890 2000
0+00:02:57
slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 7.810 2000
0+00:02:57
slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Busy 8.240 2000
0+00:02:57
During the time period It's still the same 3 jobs running on node002,
id as following:
vr019:~# condor_q -run | grep node002
53772.0 prdatlas089 8/12 04:02 0+04:22:08
slot1@xxxxxxxxxxxxxxxxxxxxxxx
54203.0 prdatlas047 8/12 11:00 0+04:12:48
slot1@xxxxxxxxxxxxxxxxxxxxxxx
54204.0 prdatlas047 8/12 11:00 0+04:12:03
slot1@xxxxxxxxxxxxxxxxxxxxxxx
I suppose that when these 3 jobs get finished, no more new jobs will
be sent to node002, but this morning I find new jobs are still landing
on node002:
svr019:~# condor_q -run | grep node002
54612.0 prdatlas089 8/12 18:55 0+01:59:12
slot1@xxxxxxxxxxxxxxxxxxxxxxx
54700.0 prdatlas089 8/12 20:38 0+01:57:21
slot1@xxxxxxxxxxxxxxxxxxxxxxx
54709.0 prdatlas089 8/12 20:43 0+00:10:24
slot1@xxxxxxxxxxxxxxxxxxxxxxx
So looks to me the condor_drain did take some drain action on the
execution machine but later the machine was automatically brought back
online, not sure if it's related to the partitionable slots setting or
something else, any idea?
Cheers,Gang
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/