Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] how does the AP resource needs scale with queue size
- Date: Mon, 14 Jul 2025 10:23:13 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] how does the AP resource needs scale with queue size
On 7/9/25 12:56 PM, Matthew West via HTCondor-users wrote:
I am curious if the developers have any updates to the general
description given in
https://research.cs.wisc.edu/htcondor/wiki-archive/pages/HowToManageLargeCondorPools/
about how a AP's cpu & memory requirements scale with the size of the
prospective job queue.
https://urldefense.com/v3/__https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html*condor-schedd-configuration-file-entries__;Iw!!Mak6IKo!Pk-oOVuaZpyGTUJ3i9IBBpa9yBQWWhUb0Bz6b6LO0aU0K079KUBJM9mDWkcKnlo-dJ-satNXPzBQc-8lp_elsZGwFeam6Q$
With modern servers able to have hundreds of GBs of system memory, is
it possible to get queues of jobs (pending >> running) into the 250k
range or higher? Or does the speed of storage or network
communication become the bottleneck before you get that large?
Hi Matt:
While that wiki page is getting kind of old, the basic architecture
information hasn't changed. I know of several sites with APs running
more than 10,000 concurrent jobs, but none at 100,000 or more. Our
scalability story is always that admins can scale out horizontally, and
add more APs to their system.
My feeling is that even when you can provision a very large memory or
cpu-count access point, admins get (rightfully) nervous about having so
many eggs in one basket. Any kernel reboot or machine glitch or ??? can
interrupt a lot of work.
-greg