[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] PASSWD_CACHE_REFRESH in 6.9.4?

Date: Tue, 23 Oct 2007 13:36:04 -0700
From: Stuart Anderson <anderson@xxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] PASSWD_CACHE_REFRESH in 6.9.4?

Dan,
	Thanks for the explanation. What is the current default and
recommended value for SCHED_UNIV_RENICE_INCREMENT in the 6.9 series?
What about adding a LOCAL_UNIV_RENICE_INCREMENT option? For example,
I would like to make the distinction that DAGMan has a higher priority
in the scheduler universe than short running user jobs in the local
universe.

Thanks.


On Tue, Oct 23, 2007 at 09:04:29AM -0500, Dan Bradley wrote:
> 
> 
> Ian Chesal wrote:
> >> This seems counter intuitive to me. Why would _not_ nice'ing 
> >> the shadow
> >> processes on a busy submit machine be a good thing?
> >>     
> >
> > Ditto. Is this a Windows scheduler only thing? I'm almost certain Alan
> > De Smet's talk every year at Condor Week talks about using higher nice
> > levels on the shadows to help out a starved-for-CPU schedd process.
> >   
> 
> If you want to increase the priority of the schedd, that is possibly a 
> good idea.  However, using SHADOW_RENICE_INCREMENT=10 to decrease the 
> priority of the shadows below all other normal processes on the system 
> degrades throughput in every case we have observed or tested in the 6.9 
> branch.  Part of the problem is that the schedd and the shadow need to 
> communicate.  During this communication, it is actually possible for the 
> schedd to be slowed down because it is stuck waiting for a response from 
> a low priority shadow.  More common is to see connection failures in the 
> shadow logs due to the shadow being so cpu starved that it cannot form a 
> connection to the schedd, even with very generous timeouts.
> 
> Another thing that has changed is that the 6.9.4 schedd is much less cpu 
> hungry than 6.8.  Having 10s of thousands of jobs in the queue and a few 
> thousand jobs running should not severely tax the 6.9.4 schedd on 
> reasonable server-class hardware unless the jobs are so fast that the 
> completion rate is greater than ~10-15 jobs per second.
> 
> I'll admit that our tests of this have all been under linux and have 
> been focussed on vanilla universe.  We're certainly hoping for feedback 
> on all the other possible usage cases.
> 
> Cheers,
> --Dan
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/

-- 
Stuart Anderson  anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson

Follow-Ups:
- Re: [Condor-users] renice increments (was: PASSWD_CACHE_REFRESH in 6.9.4)
  - From: Dan Bradley

References:
- Re: [Condor-users] PASSWD_CACHE_REFRESH in 6.9.4?
  - From: Stuart Anderson
- Re: [Condor-users] PASSWD_CACHE_REFRESH in 6.9.4?
  - From: Ian Chesal
- Re: [Condor-users] PASSWD_CACHE_REFRESH in 6.9.4?
  - From: Dan Bradley

Prev by Date: [Condor-users] Condor Parallel universe and sending jobs on same physical nodes.
Next by Date: Re: [Condor-users] Condor Parallel universe and sending jobs on same physical nodes.
Previous by thread: Re: [Condor-users] PASSWD_CACHE_REFRESH in 6.9.4?
Next by thread: Re: [Condor-users] renice increments (was: PASSWD_CACHE_REFRESH in 6.9.4)
Index(es):
- Date
- Thread