[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] High rate of random "Failed to securely exchange session key" errors in 8.6.x



Hi Zach,

I can consistently reproduce this on any machine in our production pool which is running 8.6.x. I've tried to setup a single isolated machine to try to reproduce the problem but haven't managed to yet.

I'll send you the other debugging info you requested off-list.

Thanks,
Andrew.

________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Zach Miller [zmiller@xxxxxxxxxxx]
Sent: Tuesday, May 16, 2017 6:25 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] High rate of random "Failed to securely exchange session key" errors in 8.6.x

Hi Andrew,

Thanks for the report.

Are you able to consistently reproduce this?  Does it vary by machine or is it just sporadic?

I do see an anomaly in comparing the two logs, but it isn't immediately obvious what might be wrong and I am currently investigating further.

In the meantime, you can also add D_FULLDEBUG along with D_SECURITY and you will get even more (ahem, a totally insane amount of) debugging info which you can send to me off-list if you'd like.

Also, if possible, can you capture the client side of this conversation? You are using the "-debug" flag, which is good.  Once again you can also get even more debug info by setting the environment variable
  _CONDOR_TOOL_DEBUG=D_ALL:2

...before running the command.

Thanks!


Cheers,
-zach


> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
> Of andrew.lahiff@xxxxxxxxxx
> Sent: Monday, May 15, 2017 5:38 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] High rate of random "Failed to securely
> exchange session key" errors in 8.6.x
>
>
> In case this is helpful, below is an example of what happens in MasterLog
> with "MASTER_DEBUG = D_SECURITY" when "condor_reconfig" is successful [1]
> and when it fails [2].
>
> Regards,
> Andrew.
>
> [1]
> 05/15/17 21:46:36 DC_AUTHENTICATE: received DC_AUTHENTICATE from
> <X.X.X.X:17558>
> 05/15/17 21:46:36 DC_AUTHENTICATE: generating 3DES key for session
> hostname:11227:1494881196:63...
> 05/15/17 21:46:36 SECMAN: new session, doing initial authentication.
> 05/15/17 21:46:36 Returning to DC while we wait for socket to authenticate.
> 05/15/17 21:46:36 AUTHENTICATE: setting timeout for (unknown) to 20.
> 05/15/17 21:46:36 HANDSHAKE: in handshake(my_methods = 'PASSWORD,FS')
> 05/15/17 21:46:36 HANDSHAKE: handshake() - i am the server
> 05/15/17 21:46:36 HANDSHAKE: client sent (methods == 516)
> 05/15/17 21:46:36 HANDSHAKE: i picked (method == 512)
> 05/15/17 21:46:36 HANDSHAKE: client received (method == 512)
> 05/15/17 21:46:36 PW.
> 05/15/17 21:46:36 Will return to DC because authentication is incomplete.
> 05/15/17 21:46:36 PASSWORD: entered authenticate_continue, state==100
> 05/15/17 21:46:36 PW: Server receiving 1.
> 05/15/17 21:46:36 Received: 0, 27(condor_pool@domain), 256
> 05/15/17 21:46:36 PW: Server fetching password.
> 05/15/17 21:46:36 PW: Server generating rb.
> 05/15/17 21:46:36 PW: Server sending.
> 05/15/17 21:46:36 In server_send: 0.
> 05/15/17 21:46:36 Calculating hkt 'condor_pool@domain' (27),
> 'condor_pool@domain' (27).
> 05/15/17 21:46:36 Server send 'condor_pool@domain', 'condor_pool@domain',
> 256 256 20
> 05/15/17 21:46:36 PASSWORD: leaving authenticate_continue, state==101,
> return=2
> 05/15/17 21:46:36 AUTHENTICATE: auth would still block
> 05/15/17 21:46:36 Will return to DC to continue authentication..
> 05/15/17 21:46:36 PASSWORD: entered authenticate_continue, state==101
> 05/15/17 21:46:36 PW: Server receiving 2.
> 05/15/17 21:46:36 PW: Server checking hk.
> 05/15/17 21:46:36 In calculate_hk.
> 05/15/17 21:46:36 Setting session key.
> 05/15/17 21:46:36 Key length: 20
> 05/15/17 21:46:36 PW: Server set session key.
> 05/15/17 21:46:36 PASSWORD: leaving authenticate_continue, state==101,
> return=1
> 05/15/17 21:46:36 Authentication was a Success.
> 05/15/17 21:46:36 ZKM: setting default map to condor_pool@domain
> 05/15/17 21:46:36 ZKM: post-map: current user is 'condor_pool'
> 05/15/17 21:46:36 ZKM: post-map: current domain is 'domain'
> 05/15/17 21:46:36 ZKM: post-map: current FQU is 'condor_pool@domain'
> 05/15/17 21:46:36 AUTHENTICATE: Exchanging keys with remote side.
> 05/15/17 21:46:36 In Condor_Auth_Passwd::wrap.
> 05/15/17 21:46:36 AUTHENTICATE: Result of end of authenticate is 1.
> 05/15/17 21:46:36 DC_AUTHENTICATE: authentication of X.X.X.X complete.
> 05/15/17 21:46:36 DC_AUTHENTICATE: message authenticator enabled with key
> id hostname:11227:1494881196:63.
> 05/15/17 21:46:36 DC_AUTHENTICATE: Success.
>
>
> [2]
> 05/15/17 21:46:37 DC_AUTHENTICATE: received DC_AUTHENTICATE from
> <X.X.X.X:28260>
> 05/15/17 21:46:37 DC_AUTHENTICATE: generating 3DES key for session
> hostname:11227:1494881197:64...
> 05/15/17 21:46:37 SECMAN: new session, doing initial authentication.
> 05/15/17 21:46:37 Returning to DC while we wait for socket to authenticate.
> 05/15/17 21:46:37 AUTHENTICATE: setting timeout for (unknown) to 20.
> 05/15/17 21:46:37 HANDSHAKE: in handshake(my_methods = 'PASSWORD,FS')
> 05/15/17 21:46:37 HANDSHAKE: handshake() - i am the server
> 05/15/17 21:46:37 HANDSHAKE: client sent (methods == 516)
> 05/15/17 21:46:37 HANDSHAKE: i picked (method == 512)
> 05/15/17 21:46:37 HANDSHAKE: client received (method == 512)
> 05/15/17 21:46:37 PW.
> 05/15/17 21:46:37 PW: Server receiving 1.
> 05/15/17 21:46:37 Received: 0, 27(condor_pool@domain), 256
> 05/15/17 21:46:37 PW: Server fetching password.
> 05/15/17 21:46:37 PW: Server generating rb.
> 05/15/17 21:46:37 PW: Server sending.
> 05/15/17 21:46:37 In server_send: 0.
> 05/15/17 21:46:37 Calculating hkt 'condor_pool@domain' (27),
> 'condor_pool@domain' (27).
> 05/15/17 21:46:37 Server send 'condor_pool@domain', 'condor_pool@domain',
> 256 256 20
> 05/15/17 21:46:37 Authentication was a Success.
> 05/15/17 21:46:37 ZKM: setting default map to (null)
> 05/15/17 21:46:37 ZKM: post-map: current user is '(null)'
> 05/15/17 21:46:37 ZKM: post-map: current domain is '(null)'
> 05/15/17 21:46:37 ZKM: post-map: current FQU is '(null)'
> 05/15/17 21:46:37 AUTHENTICATE: Exchanging keys with remote side.
> 05/15/17 21:46:37 In Condor_Auth_Passwd::wrap.
> 05/15/17 21:46:37 AUTHENTICATE: Result of end of authenticate is 0.
> 05/15/17 21:46:37 DC_AUTHENTICATE: required authentication of X.X.X.X
> failed: AUTHENTICATE:1005:Failed to securely exchange session key
>
> ________________________________________
> From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of
> andrew.lahiff@xxxxxxxxxx [andrew.lahiff@xxxxxxxxxx]
> Sent: Monday, May 15, 2017 7:49 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: [HTCondor-users] High rate of random "Failed to securely exchange
> session key" errors in 8.6.x
>
> Hi,
>
> I was recently upgrading lots of worker nodes from 8.6.1 to 8.6.2, and
> found that despite having:
>
> MASTER_NEW_BINARY_RESTART=PEACEFUL
>
> jobs were being killed on some (but not all) of the worker nodes. I then
> noticed that "condor_reconfig" randomly fails. Here's an example running it
> 4 times on a worker node:
>
> [root@lcg1377 ~]# condor_reconfig
> Sent "Reconfig" command to local master
> [root@lcg1377 ~]# condor_reconfig
> Sent "Reconfig" command to local master
> [root@lcg1377 ~]# condor_reconfig
> Sent "Reconfig" command to local master
> [root@lcg1377 ~]# condor_reconfig
> ERROR
> AUTHENTICATE:1005:Failed to securely exchange session key
> Can't send Reconfig command to local master
>
> And with the "-debug" option:
>
> [root@lcg1377 ~]# condor_reconfig -debug
> 05/15/17 19:41:12 condor_read() failed: recv(fd=3) returned -1, errno = 104
> Connection reset by peer, reading 5 bytes from <X.X.X.X:9618>.
> 05/15/17 19:41:12 IO: Failed to read packet header
> 05/15/17 19:41:12 SECMAN: required authentication with <X.X.X.X:9618>
> failed, so aborting command DC_RECONFIG_FULL.
> ERROR
> AUTHENTICATE:1005:Failed to securely exchange session key
> Can't send Reconfig command to local master
>
> Today I upgraded the central managers to 8.6.2, and just noticed large
> numbers of "Failed to securely exchange session key" errors in CollectorLog
> and NegotiatorLog. Despite these frequent errors things are generally
> working fine.
>
> We are using FS and PASSWORD authentication.
>
> Has anyone else seen this? With 8.4.x I don't ever remember seeing issues
> like this.
>
> Thanks,
> Andrew.
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/