Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] negotiating with schedds when a client has FW
- Date: Wed, 29 Jun 2005 09:25:56 +0100
- From: "Andrey Kaliazin" <A.Kaliazin@xxxxxxxxxxx>
- Subject: RE: [Condor-users] negotiating with schedds when a client has FW
Thanks Erik,
Your detailed explanation does shed light on this mystery.
Unfortunately (or fortunately for the users here) some recent changes to our
network infrastructure
removed a lot of problems, thus diminishing a number of the said faults to
practically nil. So it is difficult to
reproduce this error immediately to verify your cure. But I will keep an eye
on it and report results
to this forum if the problem persists.
cheers,
Andrey
PS.
> To defend Nick a bit here, few of us on the Condor Team believed you :)
It is a bit disappointing to hear that. :-(
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
> Sent: Friday, June 24, 2005 6:39 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] negotiating with schedds when a
> client has FW
>
> On Fri, Jun 17, 2005 at 03:59:56PM +0100, Andrey Kaliazin wrote:
> > Hi Thomas,
> >
> > You just got the same problem I was hit by back in February
> (subject was
> > "Negotiator gets stuck").
> > Unfortunately I did not get satisfactory response from
> developers. The best
> > proposal was
> > (from Chris Mellen) to use macro
> >
> > NEGOTIATE_ALL_JOBS_IN_CLUSTER = True
> >
> > in condor_config file where SCHEDD is running.
> > This is very useful macro indeed, but not in this particular case.
> >
> > It seems I have failed to persuade Nick LeRoy that this
> problem has nothing
> > to do with the
> > Negotiator <-> Schedd talks, but with the Negotiator <->
> Startd part of
> > negotiation process.
>
> To defend Nick a bit here, few of us on the Condor Team
> believed you :)
>
> It's certainly not designed to happen that way, and the code
> says it can't,
> but we understand how it happens.
>
> > Schedd is fine here, it provides the string of jobs to run
> and just waits
> > patiently, while Negotiator
> > dispatches them. If Start daemons respond properly
> everything is fine.
> > But, if one of the compute nodes which appears on top of
> the matched list
> > fails for various reasons
> > (mainly networking problems in our case) then Negotiator
> would not just
> > dismiss it and get the next
> > best node, but halts the whole cycle.
>
> Well, it doesn't halt the whole cycle, but it drops the
> schedd for that
> cycle. (And if you've only got one schedd, that effectively
> ends the whole
> cycle)
>
> The problem is a confluence of timeouts. The message to the
> startd, telling
> it that it's been matched, is sent as a UDP packet and isn't
> supposed to
> block (it's not integral to the matchmaking protocol that the startd
> recieve this message from the negotiator). The UDP packet
> isn't supposed
> to block when we send it. However - if it's the first time
> the negotiator
> has sent a UDP packet to the startd, it first establishes a
> TCP connection
> to the startd to create a security session - and that can
> block. With the
> firewall there, it can be 10 seconds before the TCP connect
> fails, and
> we get back to the negotiator with an error - which means we drop that
> startd from the lits of things we're considering for this
> cycle and we go
> on to the best machine to make the match, like we've always
> done and like
> everyone expects us to do.
>
> HOWEVER - back on the ranch at the schedd, no one's heard from the
> negotiator in a while (because it's been busy trying to
> connect to blocked
> startds). It turns out that we ship by default a config file that says
> "never wait more than 20 seconds for the negotiator to tell
> you something",
> so after 20 seconds of not hearing from the negotiator, the
> schedd closes
> the connection. The negotiator, meanwhile, is making another
> match for the
> schedd, and once it finds one it goes to tell the schedd -
> and discovers that
> the socket is closed, and it prints out an "Error: ignoring
> schedd for the
> cycle".
>
> The workaround is to increase your NEGOTIATOR_TIMEOUT setting on
> submit machines. Just to be safe, give it 45 or 60 seconds. Don't
> mess with it on the central manager.
>
> -Erik
>
> > And couple of minutes later, in the next cycle the story
> repeats itself,
> > because this faulty node is still on top of the list.
> >
> > regards,
> >
> > Andrey
> >
> >
> > > -----Original Message-----
> > > From: condor-users-bounces@xxxxxxxxxxx
> > > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of
> Thomas Lisson
> > > Sent: Friday, June 17, 2005 3:11 PM
> > > To: Condor-Users Mail List
> > > Subject: [Condor-users] negotiating with schedds when a
> client has FW
> > >
> > > Hello,
> > >
> > > $CondorVersion: 6.7.6 Mar 15 2005 $
> > > $CondorPlatform: I386-LINUX_RH9 $
> > >
> > > I just wondered why my machines were'nt claimed even they
> > > were unclaimed
> > > and they had all requirements.
> > >
> > > IA64/LINUX 24 24 0 0 0
> > > 0
> > > INTEL/LINUX 60 6 1 53 0
> > > 0
> > > INTEL/WINNT50 2 0 0 2 0
> > > 0
> > > INTEL/WINNT51 163 0 2 161 0
> > > 0
> > > x86_64/LINUX 1 1 0 0 0
> > > 0
> > >
> > > Total 250 31 3 216 0
> > > 0
> > >
> > >
> > > 6907.002: Run analysis summary. Of 250 machines,
> > > 25 are rejected by your job's requirements
> > > 6 reject your job because of their own requirements
> > > 3 match but are serving users with a better
> priority in the pool
> > > 216 match but reject the job for unknown reasons
> > > 0 match but will not currently preempt their existing job
> > > 0 are available to run your job
> > >
> > > [...]
> > >
> > > 6907.019: Run analysis summary. Of 250 machines,
> > > 25 are rejected by your job's requirements
> > > 6 reject your job because of their own requirements
> > > 3 match but are serving users with a better
> priority in the pool
> > > 216 match but reject the job for unknown reasons
> > > 0 match but will not currently preempt their existing job
> > > 0 are available to run your job
> > >
> > > I took a look at the Negotiator log:
> > > 6/16 13:54:33 ---------- Started Negotiation Cycle ----------
> > > 6/16 13:54:33 Phase 1: Obtaining ads from collector ...
> > > 6/16 13:54:33 Getting all public ads ...
> > > 6/16 13:54:33 Sorting 366 ads ...
> > > 6/16 13:54:33 Getting startd private ads ...
> > > 6/16 13:54:33 Got ads: 366 public and 250 private
> > > 6/16 13:54:33 Public ads include 1 submitter, 250 startd
> > > 6/16 13:54:33 Phase 2: Performing accounting ...
> > > 6/16 13:54:33 Phase 3: Sorting submitter ads by priority ...
> > > 6/16 13:54:33 Phase 4.1: Negotiating with schedds ...
> > > 6/16 13:54:33 Negotiating with nobody@*** at <***.130.4.77:9601>
> > > 6/16 13:54:33 Request 06907.00000:
> > > 6/16 13:54:33 Matched 6907.0 nobody@*** <***.130.4.77:9601>
> > > preempting none <***.130.71.149:9620>
> > > 6/16 13:54:33 Successfully matched with vm1@pc49.***
> > > 6/16 13:54:33 Request 06907.00001:
> > > 6/16 13:54:33 Matched 6907.1 nobody@*** <***.130.4.77:9601>
> > > preempting none <***.130.71.149:9620>
> > > 6/16 13:54:33 Successfully matched with vm2@pc49.***
> > > 6/16 13:54:33 Request 06907.00002:
> > > 6/16 13:57:42 Can't connect to <***.130.71.139:10066>:0,
> errno = 110
> > > 6/16 13:57:42 Will keep trying for 10 seconds...
> > > 6/16 13:57:43 Connect failed for 10 seconds; returning FALSE
> > > 6/16 13:57:43 ERROR: SECMAN:2003:TCP connection to
> > > <***.130.71.139:10066> failed
> > > 6/16 13:57:43 condor_write(): Socket closed when trying to
> > > write buffer
> > > 6/16 13:57:43 Buf::write(): condor_write() failed
> > > 6/16 13:57:43 Could not send PERMISSION
> > > 6/16 13:57:43 Error: Ignoring schedd for this cycle
> > > 6/16 13:57:43 ---------- Finished Negotiation Cycle ----------
> > >
> > > I checked ***.130.71.139 and noticed that the machine had a
> > > disfunctional network service - all requests were blocked
> > > although the
> > > machine (win xp) told me, the FW is off.
> > > OK, lets assume ***.130.71.139 blocks every incoming
> traffic, but why
> > > aren't all the other jobs serviced (6907.002-6907.019) in the
> > > same cycle?
> > > This job (6907) was finished after a while - but other
> entries in
> > > NegotiatorLog and MatchLog for that job weren't complete.
> > > Some processes
> > > of that cluster were serviced but not logged - maybe a bug.
> > >
> > > My jobs have rank = kflops in the submit files. The machine
> > > ***.130.71.139 is one of the fastest (4th) and so condor
> > > tried to claim
> > > that machine in every negotiation cycle first, because
> the 3 fastest
> > > machines were already claimed. But that machine blocked all
> > > traffic, so
> > > condor stopped matchmaking and didn't look at the next free
> > > machine. So
> > > my whole cluster was only serviced by my 3 fastest machines -
> > > out of a
> > > pool with 216 other machines that matched and had nothing
> to do. That
> > > took a long time ;)
> > >
> > > Suggestion: If Condor can't connect to a machine, it schould
> > > claim the
> > > next best free machine for a job instead of exit the
> cycle. Network
> > > problems could have big negative effects on the whole condor
> > > pool else.
> > >
> > > regards
> > > Thomas Lisson
> > > NRW-Grid
> > >
> > > _______________________________________________
> > > Condor-users mailing list
> > > Condor-users@xxxxxxxxxxx
> > > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > >
> > >
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>