Weâre running HTCondor 24.0.5 with one dedicated submit node (Access Point).
When the number of submitted jobs gets large â say over 20,000 â we notice that commands like condor_q become really slow, and sometimes even time out or fail.
If we put some idle jobs on hold, things get much more responsive again.
That helps temporarily, but weâd prefer not to intervene manually if possible.
I also tried increasing the value of SCHEDD_QUERY_WORKERS, but it didnât seem to make much difference.
So Iâm wondering if anyone has tuning tips or best practices for improving schedd performance when handling a large number of jobs.
Are there specific configuration tweaks or limits we should look into?
Thanks a lot for any suggestions!