[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] how does the AP resource needs scale with queue size




On 7/9/25 12:56 PM, Matthew West via HTCondor-users wrote:
I am curious if the developers have any updates to the general description given in https://research.cs.wisc.edu/htcondor/wiki-archive/pages/HowToManageLargeCondorPools/ about how a AP's cpu & memory requirements scale with the size of the prospective job queue.

https://urldefense.com/v3/__https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html*condor-schedd-configuration-file-entries__;Iw!!Mak6IKo!Pk-oOVuaZpyGTUJ3i9IBBpa9yBQWWhUb0Bz6b6LO0aU0K079KUBJM9mDWkcKnlo-dJ-satNXPzBQc-8lp_elsZGwFeam6Q$

With modern servers able to have hundreds of GBs of system memory, is it possible to get queues of jobs (pending >> running) into the 250k range or higher? Or does the speed of storage or network communication become the bottleneck before you get that large?


Hi Matt:

While that wiki page is getting kind of old, the basic architecture information hasn't changed. I know of several sites with APs running more than 10,000 concurrent jobs, but none at 100,000 or more. Our scalability story is always that admins can scale out horizontally, and add more APs to their system.

My feeling is that even when you can provision a very large memory or cpu-count access point, admins get (rightfully) nervous about having so many eggs in one basket. Any kernel reboot or machine glitch or ??? can interrupt a lot of work.


-greg