Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] CPU accounting: NonCondorLoadAvg

Date: Mon, 03 Jun 2013 15:31:27 +0100
From: Brian Candler <b.candler@xxxxxxxxx>
Subject: [HTCondor-users] CPU accounting: NonCondorLoadAvg

Something I have noticed for a while, but would like to investigate further.

I often see machines stop accepting jobs, because they have gone into"Owner" state. As far as I know, there's nothing in particular going onwith these machines, but I do intentionally have


  BackgroundLoad = 0.5
  START = $(CPUIdle) || (State != "Unclaimed" && State != "Owner")

so that when other (non-condor) work is done on these machines, condordoesn't pick up any more resources. What's suspicious is that in thisstate, the load average shown by condor_status is exactly 1.000:

slot1@xxxxxxxxxxxx LINUX X86_64 Owner Idle 1.000 211030+04:08:08slot1@xxxxxxxxxxxx LINUX X86_64 Owner Idle 1.000 188470+04:07:29

I would have expected this to be a variable figure, and to startaccepting jobs again when it falls back down to below 0.5.


I am using dynamic slots:

  COUNT_HYPERTHREAD_CPUS = True
  SLOT_TYPE_1 = cpus=100%, ram=75%, swap=100%, disk=100%
  SLOT_TYPE_1_PARTITIONABLE = True
  NUM_SLOTS_TYPE_1 = 1

Complete output from "condor_status" is at the end of this mail. Thereare definitely resources available to run more jobs, but I believe it'sthe LoadAv 1.000 which keeps the machine in "Owner" state, and thereforemuch less work is being done than might otherwise happen.

I'm not sure how to investigate this further. Somehow condor isseparating out the components of load average which derive from condorand non-condor processes:


  NonCondorLoadAvg    = (LoadAvg - CondorLoadAvg)

I think that CondorLoadAvg probably comes Resource::compute_condor_loadin src/condor_startd.V6/Resource.cpp but I got lost trying to work outwhat it does. Above this:


                m_load = sysapi_load_avg();
...
                m_condor_load = resmgr->sum( &Resource::condor_load );
                if( m_condor_load > m_load ) {
                        m_condor_load = m_load;
                }

and it looks like sysapi_load_avg looks in /proc/loadavg (for Linux).

Now, one thing I should point out is that the jobs I'm running areI/O-heavy, and therefore the kernel load average may be high while CPUusage is relatively low, and may not be directly comparable with afigure based on CPU usage alone. Also, I am running vanilla universejobs which spawn processes linked by pipes, they are not a singleprocess. But I still don't get where the 1.000 figure comes from.

Platform is ubuntu 12.04 x86_64 running condor 7.8.8-110288 (from Debianpackage).


Any suggestions for where to look further?

Thanks,

Brian Candler.

$ condor_status

Name OpSys Arch State Activity LoadAv MemActvtyTime

slot1@xxxxxxxxxxxx LINUX X86_64 Owner Idle 1.000 470+00:08:41slot1_10@xxxxxxxxx LINUX X86_64 Claimed Busy 0.630 15040+00:08:51slot1_11@xxxxxxxxx LINUX X86_64 Claimed Busy 0.600 15040+00:08:51slot1_12@xxxxxxxxx LINUX X86_64 Claimed Busy 0.630 15040+00:08:51slot1_13@xxxxxxxxx LINUX X86_64 Claimed Busy 1.400 15040+00:08:51slot1_14@xxxxxxxxx LINUX X86_64 Claimed Busy 1.780 15040+00:08:51slot1_15@xxxxxxxxx LINUX X86_64 Claimed Busy 1.780 15040+00:08:51slot1_16@xxxxxxxxx LINUX X86_64 Claimed Busy 1.770 15040+00:08:51slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.680 15040+00:08:52slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.670 15040+00:08:52slot1_3@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.670 15040+00:08:52slot1_4@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.680 15040+00:08:52slot1_5@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.670 15040+00:08:52slot1_6@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.680 15040+00:08:52slot1_7@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.590 15040+00:08:52slot1_8@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.600 15040+00:08:51slot1_9@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.620 15040+00:08:51slot1@xxxxxxxxxxxx LINUX X86_64 Owner Idle 1.000 188470+04:27:32slot1_1@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.220 3760+04:27:42slot1_2@xxxxxxxxxx LINUX X86_64 Claimed Busy 0.650 3760+04:27:42slot1_4@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.220 15040+00:23:23slot1_7@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.220 15040+00:23:01slot1_8@xxxxxxxxxx LINUX X86_64 Claimed Busy 1.220 15040+00:24:36Total Owner Claimed Unclaimed Matched PreemptingBackfill


        X86_64/LINUX    23     2      21         0       0 0        0

               Total    23     2      21         0       0 0        0

$ condor_status slot1@xxxxxxxxxxxxxxxxxxxx -long | grep -i loadavg
TotalCondorLoadAvg = 10.610000
TotalLoadAvg = 15.890000
LoadAvg = 1.000000
CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.500000 )
CondorLoadAvg = 0.0

Start = ( ( LoadAvg - CondorLoadAvg ) <= 0.500000 ) || ( State !="Unclaimed" && State != "Owner" )


$ condor_status slot1@xxxxxxxxxxxxxxxxxxxx -long | grep -i loadavg
TotalCondorLoadAvg = 1.140000
TotalLoadAvg = 5.480000
LoadAvg = 1.000000
CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.500000 )
CondorLoadAvg = 0.0

Start = ( ( LoadAvg - CondorLoadAvg ) <= 0.500000 ) || ( State !="Unclaimed" && State != "Owner" )

$ ssh dar4 uptime
 15:09:42 up 19 days, 20:45,  1 user,  load average: 4.51, 5.04, 6.21

So this makes sense: dar4 is running 5 jobs so I'd expect the loadaverage to be around 5. But in that case, (a) why is TotalCondorLoadAvgso much lower than TotalLoadAvg, and (b) why is LoadAvg exactly 1.000000 ?

Follow-Ups:
- Re: [HTCondor-users] CPU accounting: NonCondorLoadAvg
  - From: Brian Candler

Prev by Date: Re: [HTCondor-users] Whole memory request for wholeMemory job.
Next by Date: Re: [HTCondor-users] CPU accounting: NonCondorLoadAvg
Previous by thread: Re: [HTCondor-users] Whole memory request for wholeMemory job.
Next by thread: Re: [HTCondor-users] CPU accounting: NonCondorLoadAvg
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] CPU accounting: NonCondorLoadAvg