[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Tips for improving schedd performance with many jobs



Hi Jiaqi,

I add Tom Smith for Brookhaven National Laboratory (BNL) who has with a lot of work gotten Access Points to submit 50,000 or more running jobs. Tom should correct me, but I believe he found that he needed ~2 MB of memory per running job and far less memory for idle jobs. However for sure one of the things he did was add a large amount of memory to his Access Point servers. I put some of Tom's notes on running lots of jobs on an AP just below my signature.

@Tom could you please briefly list the other things you did to support a large number of jobs on an Access Point.

Thanks!

Fred

Regarding scheddâs not running enough jobs, there are a few potential bottlenecks weâve seen
MAX_RUNNING_JOBS is a potential first hard limit (check this)

Potentially running out of system memory (plan for ~2MB per running job) for example, if your target is 15k running jobs from a schedd, plan for 30GB of RAM for the AP

On a higher scale (~29k or so running, depending on your linux system defaults or config) you may hit default ephemeral port limits. You can check port range with cat /proc/sys/net/ipv4/ip_local_port_range

I have also seen users accidentally set an artificial limit of running jobs on themselves in their JDF. This was especially hard to find because it was transient and seemingly only affected one person


On 10/17/25 02:41, æåç wrote:
> 	
> You don't often get email from jiaqi.pan@xxxxxxxxxxxx Learn why this is important <https://aka.ms/LearnAboutSenderIdentification>
> 	
> 
> Hi Christoph,
> 
> Thanks so much for the helpful tips! Youâve given me several suggestions I hadnât even thought about -- I'll definitely try them.
> 
> Best,
> Jiaqi
> From: "Beyer, Christoph"<christoph.beyer@xxxxxxx <mailto:christoph.beyer@xxxxxxx>>
> Date: Fri, Oct 17, 2025, 14:26
> Subject: Re: [HTCondor-users] Tips for improving schedd performance with many jobs
> To: "HTCondor-Users Mail List"<htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>>
> Hi Jiaqi,
> 
> it depends a little bit on your setup, one thing you might want to consider is putting the spool directory on a fast SSD and of course the schedd needs sufficiant RAM.
> 
> The condor approach would also be to establish more scheduler, use late materialization and batches of jobs (queue 100 instead of 100 x queue 1), also teach people to use condor_watch_q instead of 'watch condor_q', limit the number of jobs per user in the queue.
> 
> Using decent hardware, fast SSDs and a fibrechannel connected filesystem for job log & output writing you can run a sched with up to 100k jobs in different states without hassling from my experience.
> 
> Things immediatley become a nuisance in the setup described if someone submits jobs with a typo in the log dir path etc. though but that's just people, they break things ;)
> 
> Best
> christoph
> 
> -- 
> Christoph Beyer
> DESY Hamburg
> IT-Department
> 
> Notkestr. 85
> Building 02b, Room 009
> 22607 Hamburg
> 
> phone:+49-(0)40-8998-2317
> mail: christoph.beyer@xxxxxxx <mailto:christoph.beyer@xxxxxxx>
> 
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *Von: *"æåç" <jiaqi.pan@xxxxxxxxxxx <mailto:jiaqi.pan@xxxxxxxxxxx>>
> *An: *"htcondor-users" <htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>>
> *Gesendet: *Freitag, 17. Oktober 2025 08:02:54
> *Betreff: *[HTCondor-users] Tips for improving schedd performance with manyÂÂÂÂÂÂÂÂjobs
> 
> Hi all,
> Weâre running HTCondor 24.0.5 with one dedicated submit node (Access Point).
> When the number of submitted jobs gets large â say over 20,000 â we notice that commands like condor_q become really slow, and sometimes even time out or fail.
> If we put some idle jobs on hold, things get much more responsive again.
> That helps temporarily, but weâd prefer not to intervene manually if possible.
> I also tried increasing the value of SCHEDD_QUERY_WORKERS, but it didnât seem to make much difference.
> So Iâm wondering if anyone has tuning tips or best practices for improving schedd performance when handling a large number of jobs.
> Are there specific configuration tweaks or limits we should look into?
> Thanks a lot for any suggestions!
> Best,
> Jiaqi
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
> subject: Unsubscribe
> 
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/ <https://www-auth.cs.wisc.edu/lists/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> 
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/


-- 
Please change my address luehring@xxxxxxxxxxx to luehring@xxxxxxx
All indiana.edu addresses will stop working in 2025.

Frederick Luehring luehring@xxxxxx       +1 812 855 1025  IU
Indiana U. HEP     Fred.Luehring@xxxxxxx +41 22 767 11 66 CERN