[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Fwd: Tips for improving schedd performance with many jobs



Thanks for this, there is a lot of useful information here.
-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Luehring, Frederick C <luehring@xxxxxx>
Sent: Friday, October 17, 2025 2:08 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Smith, Thomas <tsmith@xxxxxxx>
Subject: [HTCondor-users] Fwd: Tips for improving schedd performance with many jobs
 
This message about how BNL gets huge numbers of jobs running on a single AP was apparently rejected because Tom Smith is not (yet) a member of the HTCondor email list. Anyway Jiaqi might find the content of the email useful. 

Fred

-------- Forwarded Message --------
Subject: Re: [HTCondor-users] Tips for improving schedd performance with many jobs
Date: Fri, 17 Oct 2025 16:13:20 +0000
From: Smith, Thomas <tsmith@xxxxxxx>
To: Luehring, Frederick C <luehring@xxxxxx>, htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>


Hi Fred and Jiaqi,

I think Fred hit on the major bottlenecks that I ran into when setting up our big APs (Access Points/schedds).

Since we are using VMs for our APs, in an effort to not be wasteful with resources I ended up undershooting the amount of resources needed in the first iteration.

I found that as Fred mentioned We needed to account for roughly 2MB of memory per running job to not be at risk of running out of memory. You can maybe get away with less, but this number has been pretty reliable for us. Right now one of our APs has 18k running jobs and is using about 38GB of memory total (including what the AP itself needs to run the OS and so on). These APs can scale reliably to 50k running jobs, and we've afforded them 100GB of memory and 10 CPUs.

The idle jobs don't seem to consume much memory at all, so our big APs have accommodated hundreds of thousands of idle jobs without any extra special considerations

The next hard bottleneck we hit was linux ephemeral ports. we realized that at about 29k running jobs we were hitting a wall even though memory and all of that were sufficient. You can check your port range with `cat /proc/sys/net/ipv4/ip_local_port_range`. We have ours increased to the range 10000-64999, which should allow us 55k running jobs

Though I have the schedd config capped artificially at 50k running and 1M jobs per owner (this pool is fairly large and dedicated to a single experiment):
MAX_JOBS_RUNNING=50000
MAX_JOBS_PER_OWNER = 1000000

To scale further, since 50k running jobs is not enough to fill this pool, we have 4 APs of identical spec and configuration. 3 are enough to fill the pool completely, and one more for redundancy.

I think generally speaking the best way to scale is with more APs, not bigger APs. The only reason big APs were desired in our case was because a single user fills this pool often, and it was desirable for them to keep the number of APs to manage lower. On one of our other pools, we have as many as 30 or 40 APs of slight variation to serve different experiments and types of users

Something also potentially useful to check would be the RecentDaemonCoreDutyCycle of your AP. a command like: `condor_status -schedd -af:t Name RecentDaemonCoreDutyCycle` will show you this value for all the APs on your pool. This value is between 0 and 1, and if it is very close to 1 for a long time, your AP is very busy and may need more CPU. A huge batch of jobs being submitted at once will push this close to 1, but it's very transient and not an issue. It is only problematic if it is sustained over long periods

One thing Christoph mentioned, I've also seen before. A user running condor_q repeatedly in a very tight loop was able to absolutely cripple our AP with the sheer number of requests.

Hope some of this helps
Cheers
Tom

From: Luehring, Frederick C <luehring@xxxxxx>
Sent: Friday, October 17, 2025 11:29 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>; Smith, Thomas <tsmith@xxxxxxx>
Subject: Re: [HTCondor-users] Tips for improving schedd performance with many jobs
 
!-------------------------------------------------------------------|
  This Message Is From an External Sender
  This message came from outside your organization.
|-------------------------------------------------------------------!

Hi Jiaqi,

I add Tom Smith for Brookhaven National Laboratory (BNL) who has with a lot of work gotten Access Points to submit 50,000 or more running jobs. Tom should correct me, but I believe he found that he needed ~2 MB of memory per running job and far less memory for idle jobs. However for sure one of the things he did was add a large amount of memory to his Access Point servers. I put some of Tom's notes on running lots of jobs on an AP just below my signature.

@Tom could you please briefly list the other things you did to support a large number of jobs on an Access Point.

Thanks!

Fred

Regarding schedd’s not running enough jobs, there are a few potential bottlenecks we’ve seen
MAX_RUNNING_JOBS is a potential first hard limit (check this)

Potentially running out of system memory (plan for ~2MB per running job) for example, if your target is 15k running jobs from a schedd, plan for 30GB of RAM for the AP

On a higher scale (~29k or so running, depending on your linux system defaults or config) you may hit default ephemeral port limits. You can check port range with cat /proc/sys/net/ipv4/ip_local_port_range

I have also seen users accidentally set an artificial limit of running jobs on themselves in their JDF. This was especially hard to find because it was transient and seemingly only affected one person


On 10/17/25 02:41, 潘家琦 wrote:
>       
> You don't often get email from jiaqi.pan@xxxxxxxxxxx. Learn why this is important <https://urldefense.com/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!P4SdNyxKAPE!CaLA1TZnC4NkzzAGrr6tFN8jjvqQ1cO8VqRTE3I_jgG6HArftdG7Y65gffuBzUa0sLBQdek7cmdlbAY$ >
>       
>
> Hi Christoph,
>
> Thanks so much for the helpful tips! You’ve given me several suggestions I hadn’t even thought about -- I'll definitely try them.
>
> Best,
> Jiaqi
> From: "Beyer, Christoph"<christoph.beyer@xxxxxxx <mailto:christoph.beyer@xxxxxxx>>
> Date: Fri, Oct 17, 2025, 14:26
> Subject: Re: [HTCondor-users] Tips for improving schedd performance with many jobs
> To: "HTCondor-Users Mail List"<htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>>
> Hi Jiaqi,
>
> it depends a little bit on your setup, one thing you might want to consider is putting the spool directory on a fast SSD and of course the schedd needs sufficiant RAM.
>
> The condor approach would also be to establish more scheduler, use late materialization and  batches of jobs (queue 100 instead of 100 x queue 1), also teach people to use condor_watch_q instead of 'watch condor_q', limit the number of jobs per user in the queue.
>
> Using decent hardware, fast SSDs and a fibrechannel connected filesystem for job log & output writing you can run a sched with up to 100k jobs in different states without hassling from my experience.
>
> Things immediatley become a nuisance in the setup described if someone submits jobs with a typo in the log dir path etc. though but that's just people, they break things ;)
>
> Best
> christoph
>
> --
> Christoph Beyer
> DESY Hamburg
> IT-Department
>
> Notkestr. 85
> Building 02b, Room 009
> 22607 Hamburg
>
> phone:+49-(0)40-8998-2317
> mail: christoph.beyer@xxxxxxx <mailto:christoph.beyer@xxxxxxx>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *Von: *"潘家琦" <jiaqi.pan@xxxxxxxxxxx <mailto:jiaqi.pan@xxxxxxxxxxx>>
> *An: *"htcondor-users" <htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>>
> *Gesendet: *Freitag, 17. Oktober 2025 08:02:54
> *Betreff: *[HTCondor-users] Tips for improving schedd performance with many        jobs
>
> Hi all,
> We’re running HTCondor 24.0.5 with one dedicated submit node (Access Point).
> When the number of submitted jobs gets large — say over 20,000 — we notice that commands like condor_q become really slow, and sometimes even time out or fail.
> If we put some idle jobs on hold, things get much more responsive again.
> That helps temporarily, but we’d prefer not to intervene manually if possible.
> I also tried increasing the value of SCHEDD_QUERY_WORKERS, but it didn’t seem to make much difference.
> So I’m wondering if anyone has tuning tips or best practices for improving schedd performance when handling a large number of jobs.
> Are there specific configuration tweaks or limits we should look into?
> Thanks a lot for any suggestions!
> Best,
> Jiaqi
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
> subject: Unsubscribe
>
> The archives can be found at: https://urldefense.com/v3/__https://www-auth.cs.wisc.edu/lists/htcondor-users/__;!!P4SdNyxKAPE!CaLA1TZnC4NkzzAGrr6tFN8jjvqQ1cO8VqRTE3I_jgG6HArftdG7Y65gffuBzUa0sLBQdek7Ke61jVs$  <https://urldefense.com/v3/__https://www-auth.cs.wisc.edu/lists/htcondor-users/__;!!P4SdNyxKAPE!CaLA1TZnC4NkzzAGrr6tFN8jjvqQ1cO8VqRTE3I_jgG6HArftdG7Y65gffuBzUa0sLBQdek7Ke61jVs$
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://urldefense.com/v3/__https://www-auth.cs.wisc.edu/lists/htcondor-users/__;!!P4SdNyxKAPE!CaLA1TZnC4NkzzAGrr6tFN8jjvqQ1cO8VqRTE3I_jgG6HArftdG7Y65gffuBzUa0sLBQdek7Ke61jVs$


--
Please change my address luehring@xxxxxxxxxxx to luehring@xxxxxx.
All indiana.edu addresses will stop working in 2025.

Frederick Luehring luehring@xxxxxx       +1 812 855 1025  IU
Indiana U. HEP     Fred.Luehring@xxxxxxx +41 22 767 11 66 CERN

-- 
Please change my address luehring@xxxxxxxxxxx to luehring@xxxxxx.
All indiana.edu addresses will stop working in 2025.

Frederick Luehring luehring@xxxxxx       +1 812 855 1025  IU
Indiana U. HEP     Fred.Luehring@xxxxxxx +41 22 767 11 66 CERN