[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Tips for improving schedd performance with many jobs



Hi Christoph,

Thanks so much for the helpful tips! Youâve given me several suggestions I hadnât even thought about -- I'll definitely try them.

Best,
Jiaqi
From: "Beyer, Christoph"<christoph.beyer@xxxxxxx>
Date: Fri, Oct 17, 2025, 14:26
Subject: Re: [HTCondor-users] Tips for improving schedd performance with many jobs
To: "HTCondor-Users Mail List"<htcondor-users@xxxxxxxxxxx>
Hi Jiaqi,

it depends a little bit on your setup, one thing you might want to consider is putting the spool directory on a fast SSD and of course the schedd needs sufficiant RAM.Â

The condor approach would also be to establish more scheduler, use late materialization and batches of jobs (queue 100 instead of 100 x queue 1), also teach people to use condor_watch_q instead of 'watch condor_q', limit the number of jobs per user in the queue.Â

Using decent hardware, fast SSDs and a fibrechannel connected filesystem for job log & output writing you can run a sched with up to 100k jobs in different states without hassling from my experience.

Things immediatley become a nuisance in the setup described if someone submits jobs with a typo in the log dir path etc. though but that's just people, they break things ;)Â

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "æåç" <jiaqi.pan@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Freitag, 17. Oktober 2025 08:02:54
Betreff: [HTCondor-users] Tips for improving schedd performance with manyÂÂÂÂÂÂÂÂjobs

Hi all,
Weâre running HTCondor 24.0.5 with one dedicated submit node (Access Point).
When the number of submitted jobs gets large â say over 20,000 â we notice that commands like condor_q become really slow, and sometimes even time out or fail.
If we put some idle jobs on hold, things get much more responsive again.
That helps temporarily, but weâd prefer not to intervene manually if possible.
I also tried increasing the value of SCHEDD_QUERY_WORKERS, but it didnât seem to make much difference.
So Iâm wondering if anyone has tuning tips or best practices for improving schedd performance when handling a large number of jobs.
Are there specific configuration tweaks or limits we should look into?
Thanks a lot for any suggestions!
Best,
Jiaqi

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/