Hi all,
I have condor pool for HPC with 1 master and 20 submitters and all executors have partitionable slots, which are launched based on requirement.Â
Normally all things are running fine, I have seen when there are like more than 10k slots running or like more than 2k executors, randomly a submitter RecentDaemonCoreDutyCycle is peaked and seems like scheduling is stopped on other submitters as well, Recently I have increase File descriptors for daemons, I have increased MAX_ACCEPTS_PER_CYCLE and MAX_TIMER_EVENTS_PER_CYCLE on master and some submitters which has solved this at that time, but with increase in scale I am guessing random variables which might help.
I have only 1 condor pool, although I have virtual runTypes where a executor have a specific type and jobs marked with that type can run only on those machines.
I have mixed jobs which uses cpus, GPUs and long running as well as short running(these are in much higher number mostly).
I have some options like according to documentationÂ
1. Add more submittersÂ
2. Use different port for collector to reduce network load
Are there any suggestions to proceed with large condor pool?
Thanks and RegardsÂ
Raman