[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to Scale with HTCondor

Date: Mon, 2 Mar 2026 21:11:57 +0000
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to Scale with HTCondor

Youâre already using the best scaling option: multiple submitters (i.e. Access Points). If you canât add more APs, then we can turn to making your existing ones more efficient.

Most of the load on an AP happens when a job starts and finishes. If you can alter your workflows so that individual jobs run longer, that will help scalability. Evenly distributing the short-running jobs across all of the APs will also help.

If you have large clusters of jobs (e.g. one submission of 1000 similar jobs), then using late materialization can reduce each AP's memory usage and bookkeeping overhead for idle jobs.

 - Jaime

> On Feb 25, 2026, at 1:30âPM, Ram Ban <ramban046@xxxxxxxxx> wrote:
> 
> Hi all,
> 
> I have condor pool for HPC with 1 master and 20 submitters and all executors have partitionable slots, which are launched based on requirement. 
> Normally all things are running fine, I have seen when there are like more than 10k slots running or like more than 2k executors, randomly a submitter RecentDaemonCoreDutyCycle is peaked and seems like scheduling is stopped on other submitters as well, Recently I have increase File descriptors for daemons, I have increased MAX_ACCEPTS_PER_CYCLE and MAX_TIMER_EVENTS_PER_CYCLE on master and some submitters which has solved this at that time, but with increase in scale I am guessing random variables which might help.
> 
> I have only 1 condor pool, although I have virtual runTypes where a executor have a specific type and jobs marked with that type can run only on those machines.
> I have mixed jobs which uses cpus, GPUs and long running as well as short running(these are in much higher number mostly).
> I have some options like according to documentation 
> 1. Add more submitters 
> 2. Use different port for collector to reduce network load
> 
> Are there any suggestions to proceed with large condor pool?
> 
> Thanks and Regards 
> Raman
>

Prev by Date: Re: [HTCondor-users] Restricting usage based on Linux / LDAP / Active Directory group
Next by Date: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
Previous by thread: Re: [HTCondor-users] Restricting usage based on Linux / LDAP / Active Directory group
Next by thread: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
Index(es):
- Date
- Thread