Actually, hold on there ...
No one is able to confirm (yet) that they actually upgraded the condor_startd version to one that supports IPv6 as was suggested (grumble grumble). Let me get precise versions of all involved components (CCB, schedd, startd) to avoid setting you off on a goose chase (domestic or otherwise).
Brian
Huh. Although I am familiar with the security side of things, I have to admit I have no experience with IPv6. I will need to investigate, probably with Todd Miller's help. Thanks for the report and I will get back to you.Cheers,-zach-----Original Message----- From: HTCondor-devel [mailto:htcondor-devel-bounces@xxxxxxxxxxx] On Behalf Of Brian Bockelman Sent: Tuesday, March 28, 2017 12:44 PM To: Condor Developers <htcondor-devel@xxxxxxxxxxx> Subject: [HTCondor-devel] Fwd: [Osg-gfactory-support] About IPv6 tests in ITB pool
Hi HTCondor folk,
The claim from the CMS pilot operators is that the following does not match IPv6 addresses:
ALLOW_DAEMON=*
(They've had to explicitly list each worker node's IP address to move forward in testing...)
Can someone confirm / deny that fact?
Additionally, can someone look at the CCB log [2] below? Seems the connection reversing of the startd back to schedd is attempting to go over v4, despite this being a V6-only host. MyAddress as sent by the CCB contains both V4 and V6; V4 appears to be selected. Thoughts?
Thanks,
Brian
Begin forwarded message:
From: Diego Davila Foyo <diego.davila@xxxxxxx <mailto:diego.davila@xxxxxxx> >
Subject: RE: [Osg-gfactory-support] About IPv6 tests in ITB pool
Date: March 28, 2017 at 7:30:24 AM CDT
To: Edgar M Fajardo Hernandez <emfajardohernandez@xxxxxxxxxxxxxxxx <mailto:emfajardohernandez@xxxxxxxxxxxxxxxx> >
Cc: Jeffrey Michael Dost <jdost@xxxxxxxx <mailto:jdost@xxxxxxxx> >, "bbockelm@xxxxxxxxxxx <mailto:bbockelm@xxxxxxxxxxx> " <bbockelm@xxxxxxxxxxx <mailto:bbockelm@xxxxxxxxxxx> >, Marian Zvada <Marian.Zvada@xxxxxxx <mailto:Marian.Zvada@xxxxxxx> >, "Farrukh Aftab Khan" <farrukh.aftab.khan@xxxxxxx <mailto:farrukh.aftab.khan@xxxxxxx> >, "emfajard@xxxxxxxx <mailto:emfajard@xxxxxxxx> " <emfajard@xxxxxxxx <mailto:emfajard@xxxxxxxx> >, "osg-gfactory-support@xxxxxxxxxxxxxxxx <mailto:osg-gfactory-support@xxxxxxxxxxxxxxxx> " <osg-gfactory- support@xxxxxxxxxxxxxxxx <mailto:osg-gfactory-support@xxxxxxxxxxxxxxxx> >, Todor Trendafilov Ivanov <todor.trendafilov.ivanov@xxxxxxx <mailto:todor.trendafilov.ivanov@xxxxxxx> >, Andrea Sciaba <Andrea.Sciaba@xxxxxxx <mailto:Andrea.Sciaba@xxxxxxx> >, Duncan Rand <duncan.rand@xxxxxxxxxxxxxx <mailto:duncan.rand@xxxxxxxxxxxxxx> >, Marco Mascheroni <marco.mascheroni@xxxxxxx <mailto:marco.mascheroni@xxxxxxx> >, Raul Cardoso Lopes <raul.cardoso.lopes@xxxxxxx <mailto:raul.cardoso.lopes@xxxxxxx> >
Thank you Edgar, you were right about ALLOW_WRITE, but setting: ALLOW_DAEMON = $(ALLOW_DAEMON),pilot04@cms didn't work. I had to add Brunel's IPV6 adress explicitly, to the ALLOW_DAEMON to get it to work.
After setting ALLOW_DAEMON = $(ALLOW_DAEMON),2001:630:10:f001::19a0 in both the CCB and the Collector I started to see glideins connecting back to the collector. I had to do a special setting for HA in order to prevent the negotiator to go fermi Central Manager whenever I set PREFER_IPV4=False.
I have setup one schedd for IPV6 and sent some test jobs. The negotiation process goes well now but now I see a problem with the claim process. In the logs I can see the following: Schedd [1]. CCB [2] Startd [3]
What I find strange is that the Startd is trying to connect to the Schedd (188.184.94.50) using ipv4. We couldn't find any reference to the ipv6 address of the schedd within the logs. Any thoughts?
Regards,
Diego
[1] 03/28/17 11:33:34 Timed out requesting claim glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx <mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx> <127.0.0.1:21711>#1490692382#1#... for ddavila after REQUEST_CLAIM_TIMEOUT=240 seconds. 03/28/17 11:33:34 Match record (glidein_3722464_389738448@wn-a3-18- 00.brunel.ac.uk <mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx> <127.0.0.1:21711>#1490692382#1#... for ddavila, 171.2) deleted 03/28/17 11:33:34 Canceling request for claim glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx <mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx> <127.0.0.1:21711>#1490692382#1#... for ddavila 171.2 03/28/17 11:33:34 SECMAN: resuming command 442 REQUEST_CLAIM to startd glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx <mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx> <127.0.0.1:21711>#1490692382#1#... for ddavila from TCP port -1 (non- blocking). 03/28/17 11:33:34 SECMAN: TCP connection to startd glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx <mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx> <127.0.0.1:21711>#1490692382#1#... for ddavila failed. 03/28/17 11:33:34 Failed to send REQUEST_CLAIM to startd glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx <mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx> <127.0.0.1:21711>#1490692382#1#... for ddavila: SECMAN:2003:TCP connection to startd glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx <mailto:glidein_3722464_389738448@xxxxxxxxxxxxxxxxxxxxxxxx> <127.0.0.1:21711>#1490692382#1#... for ddavila failed.|CEDAR:6007:operation was canceled 03/28/17 11:33:34 CLOSE TCP <[2001:1458:201:e4::100:62c]:16101> fd=17
[2] 03/28/17 11:33:35 CCB: received request id 19416 from SCHEDD <188.184.94.50:4080?addrs=188.184.94.50-4080+[2001-1458-201-e4--100-62c]- 4080&noUDP&sock=23745_4d36_179> on <[2001:1458:201:e4::100:62c]:40045> for target ccbid 17198 (registered as STARTD <127.0.0.1:21711?addrs=[2001-630- 10-f001--19a0]-21711+127.0.0.1-21711&noUDP> on <[2001:630:10:f001::19a0]:8346>) 03/28/17 11:33:35 Address rewriting: refused for attribute MyAddress (MyAddress = "<188.184.94.50:4080?addrs=188.184.94.50-4080+[2001-1458-201- e4--100-62c]-4080&noUDP&sock=23745_4d36_179>"): the address isn't my default address. (Default: <188.185.81.179:9644?addrs=[2001-1458-d00-2-- 100-1ad]-9644+188.185.81.179-9644>, found in ad: <188.184.94.50:4080?addrs=188. 184.94.50-4080+[2001-1458-201-e4--100-62c]- 4080&noUDP&sock=23745_4d36_179>) 03/28/17 11:33:35 encrypting secret 03/28/17 11:33:35 condor_write(fd=22 STARTD <127.0.0.1:21711?addrs=[2001-630-10-f001--19a0]-21711+127.0.0.1- 21711&noUDP> on <[2001:630:10:f001::19a0]:8346>,,size=408,timeout=1,flags=0,non_blocking=0) 03/28/17 11:34:31 condor_read(fd=22 STARTD <127.0.0.1:21711?addrs=[2001-630-10-f001--19a0]-21711+127.0.0.1- 21711&noUDP> on <[2001:630:10:f001::19a0]:8346>,,size=21,timeout=1,flags=0,non_blocking=1) 03/28/17 11:34:31 condor_read(fd=22 STARTD <127.0.0.1:21711?addrs=[2001-630-10-f001--19a0]-21711+127.0.0.1- 21711&noUDP> on <[2001:630:10:f001::19a0]:8346>,,size=263,timeout=1,flags=0,non_blocking=1) 03/28/17 11:34:31 encrypting secret 03/28/17 11:34:31 CCB: received error from target daemon STARTD <127.0.0.1:21711?addrs=[2001-630-10-f001--19a0]-21711+127.0.0.1- 21711&noUDP> on <[2001:630:10:f001::19a0]:8346> with ccbid 17198 for request 19415 from (client which has gone away): failed to connect 03/28/17 11:34:31 CCB: client for request 19415 to target daemon STARTD <127.0.0.1:21711?addrs=[2001-630-10-f001--19a0]-21711+127.0.0.1- 21711&noUDP> on <[2001:630:10:f001::19a0]:8346> with ccbid 17198 disappeared before receiving error details. 03/28/17 11:35:02 CollectorAd : Updating ... "< Personal Condor at vocms0803.cern.ch@xxxxxxxxxxxxxxxxx <mailto:vocms0803.cern.ch@xxxxxxxxxxxxxxxxx> >" 03/28/17 11:35:02 Trying to update collector <[2001:1458:201:e4::100:535]:9618> 03/28/17 11:35:02 Attempting to send update via UDP to collector vocms0807.cern.ch <http://vocms0807.cern.ch/> <[2001:1458:201:e4::100:535]:9618> 03/28/17 11:35:02 Guess address string for host = <[2001:1458:201:e4::100:535]:9618>, port = 0 03/28/17 11:35:02 it was sinful string. ip = 2001:1458:201:e4::100:535, port = 9618 03/28/17 11:35:02 _condorOutMsg MTU changed from default to 60000 03/28/17 11:35:02 SECMAN: command 19 UPDATE_COLLECTOR_AD to collector vocms0807.cern.ch:9618 <http://vocms0807.cern.ch:9618/> from UDP port 32109 (blocking, raw). 03/28/17 11:35:02 SECMAN: no cached key for {<[2001:1458:201:e4::100:535]:9618>,<19>}. 03/28/17 11:35:02 SECMAN: Security Policy:
[3] 03/28/17 09:44:49 (pid:3625703) attempt to connect to <131.225.205.29:9668> failed: Network is unreachable (connect errno = 101). 03/28/17 09:44:49 (pid:3625703) ERROR: SECMAN:2003:TCP connection to collector cmssrv215.fnal.gov:9668 <http://cmssrv215.fnal.gov:9668/> failed. 03/28/17 09:44:49 (pid:3625703) Failed to start non-blocking update to <131.225.205.29:9668>. 03/28/17 09:48:26 (pid:3625703) attempt to connect to <188.184.94.50:4080> failed: Network is unreachable (connect errno = 101). Will keep trying for 300 total seconds (300 to go).
03/28/17 09:49:03 (pid:3625703) attempt to connect to <188.184.94.50:4080> failed: Network is unreachable (connect errno = 101).
|
|