[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTcondor



Hi Vikrant,

there are some possible reasons for this behaviour I suppose. Did you check in the negotiator log how long your negotiation cycles take ?

If you have on slow sched the negotiator might spend a lot of time talking to that sched and forget about other more relevant scheds with idle jobs - you could tune this with:

NEGOTIATOR_MAX_TIME_PER_SCHEDD (schedd level)

NEGOTIATOR_MAX_TIME_PER_PIESPIN (submitter level)

Also check on the scheds if RecentDaemonCoreDuty cycle is approaching 1.0 that is a sure sign that there is a load problem on that particular sched.

At the same time make sure you do not have anything running in full_debug - that's a performance killer at any time ;)

Also if you happen to have a lot of cores (probably over 100k) things like the slot-weight calculation becomes a burden for negotiation and jobstarts likewise. If that is the case you might want to consider putting up a 2nd negotiator, ideally for a smaller physical entity of the pool e.g. GPU machines, tag these slots and jobs and move them to a 2nd negotiator, it's easy !

In any case it always helps to put the job.queue.log file on the scheds on a fast SSD !

That is all that tumbles out of the top of my head right now ;)

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Vikrant Aggarwal" <ervikrant06@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 13. Juni 2024 23:12:30
Betreff: [HTCondor-users]  HTcondor

Hello Experts,
We have one cluster whose utilization randomly drops to 50-60%, despite having idle jobs in queue, it doesn't accept any more jobs. 

Command we use to see % utilization of clusters. We use dynamic slots 9.0.17 everywhere. 

while true ; do condor_status -const start -compact -af totalcpus cpus | awk '{totalcpus+=$1;freecpus+=$2;} END {printf "%f\n",((totalcpus-freecpus)/totalcpus)*100}' ; sleep 60 ; done

Negotiator logs show the pie limit has reached (I couldn't find a way to print this limit), AFAIU, it helps to avoid a single user dominating the whole cluster. It takes a good amount of time to do the matchmaking of jobs in the queue. We want to speed up the job matchmaking in the pie limit scenario. 

How much time it takes to re-calculate the pielimit for the users? Can we tune this parameter? 

I tried to tweak following parameter, it helps upto some extent:

PRIORITY_HALFLIFE = 300 

Don't want to ignore the user priorities completely for match making.

NEGOTIATOR_IGNORE_USER_PRIORITIES = False

During the time of issue, I do see users appearing in this output which I believe is because of pie limit? 

condor_status -negotiator -json -attr LastNegotiationCycleSubmittersShareLimit0,LastNegotiationCycleSubmittersShareLimit1,LastNegotiationCycleSubmittersShareLimit2

Also sometimes we have seen the message submitter limit is reached but I don't see any parameter at negotiator level indicating any sched/submitter limit. What parameter is controlling it? I know about MAX_JOBS_RUNNING on submitter but we were way beyond that limit. 


Thanks & Regards,
Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/