[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Best Practices for Large HPC clusters for master and submitters



Hi everyone,

Iâd like to understand the best practices and tuning parameters for the master and submit nodes. My setup includes 1 master and around 10 submitters, with all executors being dynamic and spawned based on idle jobs. Almost all parameters on machines are defaultÂ

Iâm currently facing a few issues:
1. When a submitter has a very large number of jobs (around 10,000), performance degrades significantlyâfor example, condor_q becomes very slow or appears to hang. And scheduling also becomes too slow

2. On the master, I frequently observe network saturation on a single port. Iâm currently using shared_port; is there a way to leverage multiple ports to reduce this load?

3. Occasionally, TCP sockets are closed and jobs restart due to lease duration timeouts. This seems to happen randomly, even when the system load is not particularly high.

Any guidance or recommendations would be appreciated.

Thanks and regardsÂ
RamanÂ