Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs still landing on the machine even after condor_drain

Date: Wed, 13 Aug 2014 11:15:20 +0100
From: qing <gang.qin@xxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] jobs still landing on the machine even after condor_drain

Dear Expert:

The problem is understood. We have defrag settings whichautomatically cancel the draining of that machine.


  Cheers,Gang

On 13/08/2014 10:23, qing wrote:

Dear Expert:
In our cluster we are using condor 8.2.1 and I use partitionableslots on the execute machine.
## Partitionable slots
NUM_SLOTS = 1
SLOT_TYPE_1               = cpus=100%,mem=100%,auto
NUM_SLOTS_TYPE_1          = 1
SLOT_TYPE_1_PARTITIONABLE = TRUE

# Consumption policy
CONSUMPTION_POLICY = True
SLOT_TYPE_1_CONSUMPTION_POLICY = True
SLOT_TYPE_1_CONSUMPTION_CPUS = TARGET.RequestCpus
SLOT_TYPE_1_CONSUMPTION_MEMORY = TARGET.RequestMemory
SLOT_TYPE_1_CONSUMPTION_DISK = TARGET.RequestDisk
SLOT_WEIGHT = Cpus

USE_PID_NAMESPACES = False
Yesterday we plan to reboot node002 in the condor cluster, so in theafternoon I used 'condor_drain node002.beowulf.cluster' to preventnew jobs to be sent to node002, then I see the the following change incondor_status:
slot1@xxxxxxxxxxxx LINUX X86_64 Drained Retiring 0.020 277500+00:00:04slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Retiring 8.000 20000+00:00:04slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Retiring 8.000 20000+00:00:04slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Retiring 7.990 20000+00:00:04
node002 is a 24-core machine and each job running there is a 8-corejob.
  And after some time it changed to:
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.020 277500+00:02:56slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 7.890 20000+00:02:57slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 7.810 20000+00:02:57slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Busy 8.240 20000+00:02:57
During the time period It's still the same 3 jobs running on node002,id as following:
vr019:~# condor_q -run | grep node002
53772.0 prdatlas089 8/12 04:02 0+04:22:08slot1@xxxxxxxxxxxxxxxxxxxxxxx54203.0 prdatlas047 8/12 11:00 0+04:12:48slot1@xxxxxxxxxxxxxxxxxxxxxxx54204.0 prdatlas047 8/12 11:00 0+04:12:03slot1@xxxxxxxxxxxxxxxxxxxxxxx
I suppose that when these 3 jobs get finished, no more new jobs willbe sent to node002, but this morning I find new jobs are still landingon node002:
svr019:~# condor_q -run | grep node002
54612.0 prdatlas089 8/12 18:55 0+01:59:12slot1@xxxxxxxxxxxxxxxxxxxxxxx54700.0 prdatlas089 8/12 20:38 0+01:57:21slot1@xxxxxxxxxxxxxxxxxxxxxxx54709.0 prdatlas089 8/12 20:43 0+00:10:24slot1@xxxxxxxxxxxxxxxxxxxxxxx
So looks to me the condor_drain did take some drain action on theexecution machine but later the machine was automatically brought backonline, not sure if it's related to the partitionable slots setting orsomething else, any idea?
  Cheers,Gang






_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxxwith a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

References:
- [HTCondor-users] jobs still landing on the machine even after condor_drain
  - From: qing

Prev by Date: [HTCondor-users] jobs still landing on the machine even after condor_drain
Next by Date: Re: [HTCondor-users] condor_ssh_to_job
Previous by thread: [HTCondor-users] jobs still landing on the machine even after condor_drain
Next by thread: [HTCondor-users] Java Universe: list jar files
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] jobs still landing on the machine even after condor_drain