[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Load Balancer for AP(submitter)



The schedd classad has many statistics counters, some of which always count up, and some of which are counts of current things in the Schedd.

The number of factory jobs that have not yet been materialized is reported in the JobsUnmaterialized counter.

The number of jobs that have materialized, but have not left the queue is reported as  TotalJobAds.   There are
a number of subtotals for jobs in various states. 

TotalHeldJobs = 21763
TotalIdleJobs = 65405
TotalLocalJobsIdle = 1
TotalLocalJobsRunning = 0
TotalRemovedJobs = 0
TotalRunningJobs = 18008
TotalSchedulerJobsIdle = 0
TotalSchedulerJobsRunning = 8

As for the reason behind a high DaemonCoreDutyCycle,  there are many possible reasons, it is not always easy to determine what the root cause is.   Very often the real reason for a slowdown is 

I would recommend running the tool

   condor_top 

Which shows where the Schedd is spending time.  This tool is written in python, so you can look at the code to see how it gets the data - which is basically the python equivalent of running

   condor_status -direct -schedd

Finding the attributes in the schedd ad that show time spent in various timers and functions in the Schedd and printing them out sorted by time.  

-tj



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Ram Ban <ramban046@xxxxxxxxx>
Sent: Tuesday, March 17, 2026 11:34 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Load Balancer for AP(submitter)

Hi all,

I have written a load balancer for multiple APs so that I can choose least loaded AP and Distribute load evenly.
I am using Total Jobs and RecentDaemonCoreDutyCycle into consideration, I get this info using condor_status -schedd cmd

There are major 2 problems I am facing

1. I am using jobs with max_idle 300 with maximum of 2000 jobs in a cluster, But the jobs in factory are not visible in cmd(condor_status) and I kind of oversubscribe and AP leading to slowness, Is there any way to get total jobs(including in factory), instead of condor_q(as this stucks a lot and I am removing it's dependency to reduce load).

2. If any AP has RecentDaemonCoreDutyCycle high, I am not able to debug the reason, why this is happening?

Are there any other factors I should be considering for load balancer?

Thanks and Regards
Raman