Hello Experts,
We have one cluster whose utilization randomly drops to 50-60%, despite having idle jobs in queue, it doesn't accept any more jobs.Â
Command we use to see % utilization of clusters. We use dynamic slots 9.0.17 everywhere.Â
while true ; do condor_status -const start -compact -af totalcpus cpus | awk '{totalcpus+=$1;freecpus+=$2;} END {printf "%f\n",((totalcpus-freecpus)/totalcpus)*100}' ; sleep 60 ; done
Negotiator logs show the pie limit has reached (I couldn't find a way to print this limit), AFAIU, it helps to avoid a single user dominating the whole cluster. It takes a good amount of time to do the matchmaking of jobs in the queue. We want to speed up the job matchmaking in the pie limit scenario.Â
How much time it takes to re-calculate the pielimit for the users? Can we tune this parameter?Â
I tried to tweak following parameter, it helps upto some extent:
PRIORITY_HALFLIFE = 300Â
Don't want to ignore the user priorities completelyÂfor match making.
NEGOTIATOR_IGNORE_USER_PRIORITIES = False
During the time of issue, I do see users appearing in this output which I believe is because of pie limit?Â
condor_status -negotiator -json -attr LastNegotiationCycleSubmittersShareLimit0,LastNegotiationCycleSubmittersShareLimit1,LastNegotiationCycleSubmittersShareLimit2
Also sometimes we have seen the message submitter limit is reached but I don't see any parameter at negotiator level indicating any sched/submitter limit. What parameter is controlling it? I know about MAX_JOBS_RUNNING on submitter but we were way beyond that limit.Â
Thanks & Regards,
Vikrant Aggarwal