Re: [HTCondor-devel] Fwd: [Osg-gfactory-support] About IPv6 tests in ITB pool


Date: Tue, 28 Mar 2017 13:04:57 -0500
From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] Fwd: [Osg-gfactory-support] About IPv6 tests in ITB pool
Actually, hold on there ...

No one is able to confirm (yet) that they actually upgraded the condor_startd version to one that supports IPv6 as was suggested (grumble grumble).  Let me get precise versions of all involved components (CCB, schedd, startd) to avoid setting you off on a goose chase (domestic or otherwise).

Brian

On Mar 28, 2017, at 12:55 PM, Zach Miller <zmiller@xxxxxxxxxxx> wrote:

Huh.  Although I am familiar with the security side of things, I have to admit I have no experience with IPv6.  I will need to investigate, probably with Todd Miller's help.  Thanks for the report and I will get back to you.


Cheers,
-zach


-----Original Message-----
From: HTCondor-devel [mailto:htcondor-devel-bounces@xxxxxxxxxxx] On Behalf
Of Brian Bockelman
Sent: Tuesday, March 28, 2017 12:44 PM
To: Condor Developers <htcondor-devel@xxxxxxxxxxx>
Subject: [HTCondor-devel] Fwd: [Osg-gfactory-support] About IPv6 tests in
ITB pool

Hi HTCondor folk,

The claim from the CMS pilot operators is that the following does not match
IPv6 addresses:

ALLOW_DAEMON=*

(They've had to explicitly list each worker node's IP address to move
forward in testing...)

Can someone confirm / deny that fact?

Additionally, can someone look at the CCB log [2] below?  Seems the
connection reversing of the startd back to schedd is attempting to go over
v4, despite this being a V6-only host.  MyAddress as sent by the CCB
contains both V4 and V6; V4 appears to be selected.  Thoughts?

Thanks,

Brian



Begin forwarded message:

From: Diego Davila Foyo <diego.davila@xxxxxxx
<mailto:diego.davila@xxxxxxx> >

Subject: RE: [Osg-gfactory-support] About IPv6 tests in ITB pool

Date: March 28, 2017 at 7:30:24 AM CDT

To: Edgar M Fajardo Hernandez <emfajardohernandez@xxxxxxxxxxxxxxxx
<mailto:emfajardohernandez@xxxxxxxxxxxxxxxx> >

Cc: Jeffrey Michael Dost <jdost@xxxxxxxx <mailto:jdost@xxxxxxxx> >,
"bbockelm@xxxxxxxxxxx <mailto:bbockelm@xxxxxxxxxxx> " <bbockelm@xxxxxxxxxxx
<mailto:bbockelm@xxxxxxxxxxx> >, Marian Zvada <Marian.Zvada@xxxxxxx
<mailto:Marian.Zvada@xxxxxxx> >, "Farrukh Aftab Khan"
<farrukh.aftab.khan@xxxxxxx <mailto:farrukh.aftab.khan@xxxxxxx> >,
"emfajard@xxxxxxxx <mailto:emfajard@xxxxxxxx> " <emfajard@xxxxxxxx
<mailto:emfajard@xxxxxxxx> >, "osg-gfactory-support@xxxxxxxxxxxxxxxx
<mailto:osg-gfactory-support@xxxxxxxxxxxxxxxx> " <osg-gfactory-
support@xxxxxxxxxxxxxxxx <mailto:osg-gfactory-support@xxxxxxxxxxxxxxxx> >,
Todor Trendafilov Ivanov <todor.trendafilov.ivanov@xxxxxxx
<mailto:todor.trendafilov.ivanov@xxxxxxx> >, Andrea Sciaba
<Andrea.Sciaba@xxxxxxx <mailto:Andrea.Sciaba@xxxxxxx> >, Duncan Rand
<duncan.rand@xxxxxxxxxxxxxx <mailto:duncan.rand@xxxxxxxxxxxxxx> >, Marco
Mascheroni <marco.mascheroni@xxxxxxx <mailto:marco.mascheroni@xxxxxxx> >,
Raul Cardoso Lopes <raul.cardoso.lopes@xxxxxxx
<mailto:raul.cardoso.lopes@xxxxxxx> >


Thank you Edgar, you were right about ALLOW_WRITE, but setting:
ALLOW_DAEMON = $(ALLOW_DAEMON),pilot04@cms  didn't work. I had to  add
Brunel's IPV6 adress explicitly, to the ALLOW_DAEMON to get it to work.

After setting ALLOW_DAEMON = $(ALLOW_DAEMON),2001:630:10:f001::19a0
in both the CCB and the Collector I started to see glideins connecting back
to the collector. I had to do a special setting for HA in order to prevent
the negotiator to go fermi Central Manager whenever I set
PREFER_IPV4=False.


I have setup one schedd for IPV6 and sent some test jobs. The
negotiation process goes well now but now I see a problem with the claim
process. In the logs I can see the following:
Schedd [1].
CCB [2]
Startd [3]

What I find strange is that the Startd is trying to connect to the
Schedd (188.184.94.50) using ipv4. We couldn't find any reference to the
ipv6 address of the schedd within the logs. Any thoughts?

Regards,

Diego




[1]
03/28/17 11:33:34 Timed out requesting claim
glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx
<mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx>
<127.0.0.1:21711>#1490692382#1#... for ddavila after
REQUEST_CLAIM_TIMEOUT=240 seconds.
03/28/17 11:33:34 Match record (glidein_3722464_389738448@wn-a3-18-
00.brunel.ac.uk <mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx>
<127.0.0.1:21711>#1490692382#1#... for ddavila, 171.2) deleted
03/28/17 11:33:34 Canceling request for claim
glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx
<mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx>
<127.0.0.1:21711>#1490692382#1#... for ddavila 171.2
03/28/17 11:33:34 SECMAN: resuming command 442 REQUEST_CLAIM to
startd glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx
<mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx>
<127.0.0.1:21711>#1490692382#1#... for ddavila from TCP port -1 (non-
blocking).
03/28/17 11:33:34 SECMAN: TCP connection to startd
glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx
<mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx>
<127.0.0.1:21711>#1490692382#1#... for ddavila failed.
03/28/17 11:33:34 Failed to send REQUEST_CLAIM to startd
glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx
<mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx>
<127.0.0.1:21711>#1490692382#1#... for ddavila: SECMAN:2003:TCP connection
to startd glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx
<mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx>
<127.0.0.1:21711>#1490692382#1#... for ddavila failed.|CEDAR:6007:operation
was canceled
03/28/17 11:33:34 CLOSE TCP <[2001:1458:201:e4::100:62c]:16101>
fd=17

[2]
03/28/17 11:33:35 CCB: received request id 19416 from SCHEDD
<188.184.94.50:4080?addrs=188.184.94.50-4080+[2001-1458-201-e4--100-62c]-
4080&noUDP&sock=23745_4d36_179> on <[2001:1458:201:e4::100:62c]:40045> for
target ccbid 17198 (registered as STARTD <127.0.0.1:21711?addrs=[2001-630-
10-f001--19a0]-21711+127.0.0.1-21711&noUDP> on
<[2001:630:10:f001::19a0]:8346>)
03/28/17 11:33:35 Address rewriting: refused for attribute MyAddress
(MyAddress = "<188.184.94.50:4080?addrs=188.184.94.50-4080+[2001-1458-201-
e4--100-62c]-4080&noUDP&sock=23745_4d36_179>"): the address isn't my
default address. (Default: <188.185.81.179:9644?addrs=[2001-1458-d00-2--
100-1ad]-9644+188.185.81.179-9644>, found in ad:
<188.184.94.50:4080?addrs=188.
184.94.50-4080+[2001-1458-201-e4--100-62c]-
4080&noUDP&sock=23745_4d36_179>)
03/28/17 11:33:35 encrypting secret
03/28/17 11:33:35 condor_write(fd=22 STARTD
<127.0.0.1:21711?addrs=[2001-630-10-f001--19a0]-21711+127.0.0.1-
21711&noUDP> on
<[2001:630:10:f001::19a0]:8346>,,size=408,timeout=1,flags=0,non_blocking=0)
03/28/17 11:34:31 condor_read(fd=22 STARTD
<127.0.0.1:21711?addrs=[2001-630-10-f001--19a0]-21711+127.0.0.1-
21711&noUDP> on
<[2001:630:10:f001::19a0]:8346>,,size=21,timeout=1,flags=0,non_blocking=1)
03/28/17 11:34:31 condor_read(fd=22 STARTD
<127.0.0.1:21711?addrs=[2001-630-10-f001--19a0]-21711+127.0.0.1-
21711&noUDP> on
<[2001:630:10:f001::19a0]:8346>,,size=263,timeout=1,flags=0,non_blocking=1)
03/28/17 11:34:31 encrypting secret
03/28/17 11:34:31 CCB: received error from target daemon STARTD
<127.0.0.1:21711?addrs=[2001-630-10-f001--19a0]-21711+127.0.0.1-
21711&noUDP> on <[2001:630:10:f001::19a0]:8346> with ccbid 17198 for
request 19415 from (client which has gone away): failed to connect
03/28/17 11:34:31 CCB: client for request 19415 to target daemon
STARTD <127.0.0.1:21711?addrs=[2001-630-10-f001--19a0]-21711+127.0.0.1-
21711&noUDP> on <[2001:630:10:f001::19a0]:8346> with ccbid 17198
disappeared before receiving error details.
03/28/17 11:35:02 CollectorAd  : Updating ... "< Personal Condor at
vocms0803.cern.ch@xxxxxxxxxxxxxxxxx
<mailto:vocms0803.cern.ch@xxxxxxxxxxxxxxxxx>  >"
03/28/17 11:35:02 Trying to update collector
<[2001:1458:201:e4::100:535]:9618>
03/28/17 11:35:02 Attempting to send update via UDP to collector
vocms0807.cern.ch <http://vocms0807.cern.ch/>
<[2001:1458:201:e4::100:535]:9618>
03/28/17 11:35:02 Guess address string for host =
<[2001:1458:201:e4::100:535]:9618>, port = 0
03/28/17 11:35:02 it was sinful string. ip =
2001:1458:201:e4::100:535, port = 9618
03/28/17 11:35:02 _condorOutMsg MTU changed from default to 60000
03/28/17 11:35:02 SECMAN: command 19 UPDATE_COLLECTOR_AD to
collector vocms0807.cern.ch:9618 <http://vocms0807.cern.ch:9618/>  from UDP
port 32109 (blocking, raw).
03/28/17 11:35:02 SECMAN: no cached key for
{<[2001:1458:201:e4::100:535]:9618>,<19>}.
03/28/17 11:35:02 SECMAN: Security Policy:


[3]
03/28/17 09:44:49 (pid:3625703) attempt to connect to
<131.225.205.29:9668> failed: Network is unreachable (connect errno = 101).
03/28/17 09:44:49 (pid:3625703) ERROR: SECMAN:2003:TCP connection to
collector cmssrv215.fnal.gov:9668 <http://cmssrv215.fnal.gov:9668/>
failed.
03/28/17 09:44:49 (pid:3625703) Failed to start non-blocking update
to <131.225.205.29:9668>.
03/28/17 09:48:26 (pid:3625703) attempt to connect to
<188.184.94.50:4080> failed: Network is unreachable (connect errno = 101).
Will keep trying for 300 total seconds (300 to go).

03/28/17 09:49:03 (pid:3625703) attempt to connect to
<188.184.94.50:4080> failed: Network is unreachable (connect errno = 101).

[← Prev in Thread] Current Thread [Next in Thread→]