[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow errors after installing 7.5.3



MAX_ACCEPTS_PER_CYCLE is a daemon core event loop variable which gives
priority to connecting sockets in the event loop within condor, which
effect *daemon-core based daemons.  Out of the gate, the schedd can
spawn more processes then it can handle in it's current architecture.
Some of this can be partially mitigated with existing variables, but it
does not address the crux of the matter. The primary culprit is that the
event loop is rather naive about how/where it spends it's time, so when
the schedd spawns a series of shadows it spends a copious amount of time
handling internal timers but not enough time talking with it's children,
who "give up" in trying to communicate with their parent.  As a result
during large job bursts we had seen, under certain conditions, condor
would enter into a "death spiral" of spawning & reaping shadows.  This
variable gives priority to the initial socket accept which prevents
shadow cycling, and helps communications issues in general.  

We've noticed a significant decrease in the number of shadows which are
spawned/killed when one tunes this variable, whose sweet spot is between
3 & 5 depending on your pool.  Not to mention it decreases the number of
overall communications errors and job restarts which directly affects
pool throughput.  

e.g. in a oversubscribed pool of 3000 nodes we had seen communications
errors in the range of several thousand, go down to less then 50.   

Cheers,
Tim  

On Thu, 2010-07-01 at 17:43 -0400, Peter Doherty wrote:
> Could you explain these variables a bit?  I can't find   
> MAX_ACCEPTS_PER_CYCLE in the condor manual, and a google search only  
> turns up this:
> 
> https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1348
> 
> I gather it has something to do with a shadow persisting after a job  
> exits, and continuing as a shadow for the next job on the resource,  
> vs. just having the shadow exit and respawn for the next job.
> 
> Peter
> 
> 
> On Jul 1, 2010, at 13:30 , Timothy St. Clair wrote:
> 
> > Peter & Dan -
> >
> > 	If you're aiming for performance gains that may have a counter  
> > affect.
> > You will also want to set MAX_ACCEPTS_PER_CYCLE=4.
> >
> > Cheers,
> > Tim
> >
> > On Thu, 2010-07-01 at 11:22 -0400, Peter Doherty wrote:
> >> Thanks Dan, that did the trick.
> >>
> >> As always, lightning quick diags and solutions.
> >>
> >> Best,
> >> Peter
> >>
> >>
> >> On Jul 1, 2010, at 04:28 , Dan Bradley wrote:
> >>
> >>> Peter,
> >>>
> >>> Please try the following configuration setting:
> >>>
> >>> SHADOW_WORKLIFE = 0
> >>>
> >>> Based on your report, I reproduced the problem in 7.5.3 and
> >>> confirmed that the above setting avoids the problem.
> >>>
> >>> --Dan
> >>>
> >>> Peter Doherty wrote:
> >>>> Hi,
> >>>>
> >>>> So I thought I'd try out 7.5.3.
> >>>> I'm getting errors in my ShadowLog that I didn't have before, and I
> >>>> don't really know exactly what they mean.  It's showing auth
> >>>> failures, but I'm not clear what daemon is trying to authenticate
> >>>> with which other daemon.  I'm using GSI security.  And I've still
> >>>> got jobs running, so I'm not sure what's really a fatal error here.
> >>>> I guess I'll turn on more debugging and see if that makes any
> >>>> sense, but I'd love any tips if you've got them.
> >>>> The log entries are below:
> >>>>
> >>>> thanks,
> >>>> peter
> >>>>
> >>>>
> >>>> (I changed some IPs/hostnames/DNs in these log events.)
> >>>>
> >>>> 06/30/10 21:59:37 (2284464.0) (30749):
> >>>> SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION: failed to create security
> >>>> session for <10.200.4.14:56387>#1277942747#3#..., so will fall back
> >>>> on security negotiation
> >>>> 06/30/10 21:59:37 (2284464.0) (30749):
> >>>> SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION: failed to create security
> >>>> session for filetrans.<10.200.4.14:56387>#1277942747#3#..., so will
> >>>> fall back on security negotiation
> >>>>
> >>>> 06/30/10 21:59:43 (2284464.0) (30749): DC_AUTHENTICATE: required
> >>>> authentication of 200.136.80.9 failed: AUTHENTICATE:1003:Failed to
> >>>> authenticate with any method|AUTHENTICATE:1004:Failed to
> >>>> authenticate using GSI|GSI:5005:Failed to authenticate with
> >>>> client.  Client does not trust our certificate.  You may want to
> >>>> check the GSI_DAEMON_NAME in the condor_config|GSI:5004:Failed to
> >>>> gss_assist_gridmap /DC=org/DC=doegrids/OU=Services/CN=hostname to a
> >>>> local user.  Check the grid-mapfile.
> >>>>
> >>>> 06/30/10 21:59:43 (2284464.0) (30749): ERROR "Error from
> >>>> computer@hostname: Could not initiate file transfer" at line 658 in
> >>>> file pseudo_ops.cpp
> >>>> _______________________________________________
> >>>> Condor-users mailing list
> >>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> >>>> with a
> >>>> subject: Unsubscribe
> >>>> You can also unsubscribe by visiting
> >>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>>
> >>>> The archives can be found at:
> >>>> https://lists.cs.wisc.edu/archive/condor-users/
> >>>>
> >>> _______________________________________________
> >>> Condor-users mailing list
> >>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> >>> with a
> >>> subject: Unsubscribe
> >>> You can also unsubscribe by visiting
> >>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>
> >>> The archives can be found at:
> >>> https://lists.cs.wisc.edu/archive/condor-users/
> >>
> >> _______________________________________________
> >> Condor-users mailing list
> >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx  
> >> with a
> >> subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >> The archives can be found at:
> >> https://lists.cs.wisc.edu/archive/condor-users/
> >
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx  
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/