[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Load Balancer for AP(submitter)

Date: Wed, 18 Mar 2026 05:32:28 +0000
From: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Load Balancer for AP(submitter)

The schedd classad has many statistics counters, some of which always count up, and some of which are counts of current things in the Schedd.

The number of factory jobs that have not yet been materialized is reported in the JobsUnmaterialized counter.

The number of jobs that have materialized, but have not left the queue is reported as TotalJobAds. There are

a number of subtotals for jobs in various states.

TotalHeldJobs = 21763

TotalIdleJobs = 65405

TotalLocalJobsIdle = 1

TotalLocalJobsRunning = 0

TotalRemovedJobs = 0

TotalRunningJobs = 18008

TotalSchedulerJobsIdle = 0

TotalSchedulerJobsRunning = 8

As for the reason behind a high DaemonCoreDutyCycle, there are many possible reasons, it is not always easy to determine what the root cause is. Very often the real reason for a slowdown is

I would recommend running the tool

condor_top

Which shows where the Schedd is spending time. This tool is written in python, so you can look at the code to see how it gets the data - which is basically the python equivalent of running

condor_status -direct -schedd

Finding the attributes in the schedd ad that show time spent in various timers and functions in the Schedd and printing them out sorted by time.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Ram Ban <ramban046@xxxxxxxxx>
Sent: Tuesday, March 17, 2026 11:34 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Load Balancer for AP(submitter)

Hi all,

I have written a load balancer for multiple APs so that I can choose least loaded AP and Distribute load evenly.

I am using Total Jobs and RecentDaemonCoreDutyCycle into consideration, I get this info using condor_status -schedd cmd

There are major 2 problems I am facing

1. I am using jobs with max_idle 300 with maximum of 2000 jobs in a cluster, But the jobs in factory are not visible in cmd(condor_status) and I kind of oversubscribe and AP leading to slowness, Is there any way to get total jobs(including in factory), instead of condor_q(as this stucks a lot and I am removing it's dependency to reduce load).

2. If any AP has RecentDaemonCoreDutyCycle high, I am not able to debug the reason, why this is happening?

Are there any other factors I should be considering for load balancer?

Thanks and Regards

Raman

Follow-Ups:
- Re: [HTCondor-users] Load Balancer for AP(submitter)
  - From: Anderson, Stuart B.

References:
- [HTCondor-users] Load Balancer for AP(submitter)
  - From: Ram Ban

Prev by Date: Re: [HTCondor-users] OSG School 2026: Apply now and learn to harness large-scale computing for research
Next by Date: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
Previous by thread: [HTCondor-users] Load Balancer for AP(submitter)
Next by thread: Re: [HTCondor-users] Load Balancer for AP(submitter)
Index(es):
- Date
- Thread