[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Best Practices for Large HPC clusters for master and submitters



i can't speak to the second two items, but on the first make sure you
have blazing fast disks supporting the condor database files

On Tue, Feb 3, 2026 at 1:15âPM Ram Ban <ramban046@xxxxxxxxx> wrote:
>
> Hi everyone,
>
> Iâd like to understand the best practices and tuning parameters for the master and submit nodes. My setup includes 1 master and around 10 submitters, with all executors being dynamic and spawned based on idle jobs. Almost all parameters on machines are default
>
> Iâm currently facing a few issues:
> 1. When a submitter has a very large number of jobs (around 10,000), performance degrades significantlyâfor example, condor_q becomes very slow or appears to hang. And scheduling also becomes too slow
>
> 2. On the master, I frequently observe network saturation on a single port. Iâm currently using shared_port; is there a way to leverage multiple ports to reduce this load?
>
> 3. Occasionally, TCP sockets are closed and jobs restart due to lease duration timeouts. This seems to happen randomly, even when the system load is not particularly high.
>
> Any guidance or recommendations would be appreciated.
>
> Thanks and regards
> Raman
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/