Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] CPU accounting: NonCondorLoadAvg
- Date: Mon, 03 Jun 2013 15:31:27 +0100
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: [HTCondor-users] CPU accounting: NonCondorLoadAvg
Something I have noticed for a while, but would like to investigate further.
I often see machines stop accepting jobs, because they have gone into
"Owner" state. As far as I know, there's nothing in particular going on
with these machines, but I do intentionally have
BackgroundLoad = 0.5
START = $(CPUIdle) || (State != "Unclaimed" && State != "Owner")
so that when other (non-condor) work is done on these machines, condor
doesn't pick up any more resources. What's suspicious is that in this
state, the load average shown by condor_status is exactly 1.000:
slot1@xxxxxxxxxxxx LINUX X86_64 Owner Idle 1.000 21103
0+04:08:08
slot1@xxxxxxxxxxxx LINUX X86_64 Owner Idle 1.000 18847
0+04:07:29
I would have expected this to be a variable figure, and to start
accepting jobs again when it falls back down to below 0.5.
I am using dynamic slots:
COUNT_HYPERTHREAD_CPUS = True
SLOT_TYPE_1 = cpus=100%, ram=75%, swap=100%, disk=100%
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_1 = 1
Complete output from "condor_status" is at the end of this mail. There
are definitely resources available to run more jobs, but I believe it's
the LoadAv 1.000 which keeps the machine in "Owner" state, and therefore
much less work is being done than might otherwise happen.
I'm not sure how to investigate this further. Somehow condor is
separating out the components of load average which derive from condor
and non-condor processes:
NonCondorLoadAvg = (LoadAvg - CondorLoadAvg)
I think that CondorLoadAvg probably comes Resource::compute_condor_load
in src/condor_startd.V6/Resource.cpp but I got lost trying to work out
what it does. Above this:
m_load = sysapi_load_avg();
...
m_condor_load = resmgr->sum( &Resource::condor_load );
if( m_condor_load > m_load ) {
m_condor_load = m_load;
}
and it looks like sysapi_load_avg looks in /proc/loadavg (for Linux).
Now, one thing I should point out is that the jobs I'm running are
I/O-heavy, and therefore the kernel load average may be high while CPU
usage is relatively low, and may not be directly comparable with a
figure based on CPU usage alone. Also, I am running vanilla universe
jobs which spawn processes linked by pipes, they are not a single
process. But I still don't get where the 1.000 figure comes from.
Platform is ubuntu 12.04 x86_64 running condor 7.8.8-110288 (from Debian
package).
Any suggestions for where to look further?
Thanks,
Brian Candler.
$ condor_status
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
slot1@xxxxxxxxxxxx LINUX X86_64 Owner Idle 1.000 47
0+00:08:41
slot1_10@xxxxxxxxx LINUX X86_64 Claimed Busy 0.630 1504
0+00:08:51
slot1_11@xxxxxxxxx LINUX X86_64 Claimed Busy 0.600 1504
0+00:08:51
slot1_12@xxxxxxxxx LINUX X86_64 Claimed Busy 0.630 1504
0+00:08:51
slot1_13@xxxxxxxxx LINUX X86_64 Claimed Busy 1.400 1504
0+00:08:51
slot1_14@xxxxxxxxx LINUX X86_64 Claimed Busy 1.780 1504
0+00:08:51
slot1_15@xxxxxxxxx LINUX X86_64 Claimed Busy 1.780 1504
0+00:08:51
slot1_16@xxxxxxxxx LINUX X86_64 Claimed Busy 1.770 1504
0+00:08:51
slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.680 1504
0+00:08:52
slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.670 1504
0+00:08:52
slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.670 1504
0+00:08:52
slot1_4@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.680 1504
0+00:08:52
slot1_5@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.670 1504
0+00:08:52
slot1_6@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.680 1504
0+00:08:52
slot1_7@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.590 1504
0+00:08:52
slot1_8@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.600 1504
0+00:08:51
slot1_9@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.620 1504
0+00:08:51
slot1@xxxxxxxxxxxx LINUX X86_64 Owner Idle 1.000 18847
0+04:27:32
slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.220 376
0+04:27:42
slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.650 376
0+04:27:42
slot1_4@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.220 1504
0+00:23:23
slot1_7@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.220 1504
0+00:23:01
slot1_8@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.220 1504
0+00:24:36
Total Owner Claimed Unclaimed Matched Preempting
Backfill
X86_64/LINUX 23 2 21 0 0 0 0
Total 23 2 21 0 0 0 0
$ condor_status slot1@xxxxxxxxxxxxxxxxxxxx -long | grep -i loadavg
TotalCondorLoadAvg = 10.610000
TotalLoadAvg = 15.890000
LoadAvg = 1.000000
CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.500000 )
CondorLoadAvg = 0.0
Start = ( ( LoadAvg - CondorLoadAvg ) <= 0.500000 ) || ( State !=
"Unclaimed" && State != "Owner" )
$ condor_status slot1@xxxxxxxxxxxxxxxxxxxx -long | grep -i loadavg
TotalCondorLoadAvg = 1.140000
TotalLoadAvg = 5.480000
LoadAvg = 1.000000
CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.500000 )
CondorLoadAvg = 0.0
Start = ( ( LoadAvg - CondorLoadAvg ) <= 0.500000 ) || ( State !=
"Unclaimed" && State != "Owner" )
$ ssh dar4 uptime
15:09:42 up 19 days, 20:45, 1 user, load average: 4.51, 5.04, 6.21
So this makes sense: dar4 is running 5 jobs so I'd expect the load
average to be around 5. But in that case, (a) why is TotalCondorLoadAvg
so much lower than TotalLoadAvg, and (b) why is LoadAvg exactly 1.000000 ?