Hello,
I've got a production condor pool that has been running for a while but
is now getting communication errors with multiple remote sites. CCB is
used to talk to the remote sites. The local machine is 131.225.216.64
and the schedd log seems to be saying that the local schedd can't
connect to the local CCB servers running on ports 9877, 9878, and 9879.
This pool was in use and I'm not aware of any config changes made. I
have restarted the pool and removed all the remote glideins since jobs
weren't runnng anyway. Nothing helped.
The remote machines do seem to be properly reporting back to the
Collector via CCB and it's the local daemons that don't seem to be able
to communicate.
Any help appreciated.
joe
Here is a snippet of the ShadowLog:
02/26 09:32:19 (63446.0) (21089): attempt to connect to
<131.225.216.64:9878> failed: timed out after 20 seconds.
02/26 09:32:19 (63446.0) (21089): Failed to reverse connect to startd
glidein_19053@xxxxxxxxxxxxxxxxxxxx via CCB.
02/26 09:32:19 (63446.0) (21089): glidein_19053@xxxxxxxxxxxxxxxxxxxx:
DCStartd::activateClaim: Failed to send command ACTIVATE_CLAIM to the
startd
02/26 09:32:19 (63446.0) (21089): Job 63446.0 is being evicted from
glidein_19053@xxxxxxxxxxxxxxxxxxxx
02/26 09:32:25 (63401.0) (21462): attempt to connect to
<131.225.216.64:9879> failed: timed out after 20 seconds.
02/26 09:32:25 (63401.0) (21462): Failed to reverse connect to startd
glidein_1605@xxxxxxxxxxxxxxxxxxxx via CCB.
02/26 09:32:25 (63401.0) (21462): glidein_1605@xxxxxxxxxxxxxxxxxxxx:
DCStartd::activateClaim: Failed to send command ACTIVATE_CLAIM to the
startd
02/26 09:32:25 (63401.0) (21462): Job 63401.0 is being evicted from
glidein_1605@xxxxxxxxxxxxxxxxxxxx
02/26 09:32:33 (64305.0) (20721): attempt to connect to
<131.225.216.64:9878> failed: timed out after 20 seconds.
02/26 09:32:33 (64305.0) (20721): Failed to reverse connect to
<145.100.48.37:53084> via CCB.
02/26 09:32:33 (64305.0) (20721): RemoteResource::killStarter(): Could
not send command to startd
02/26 09:32:33 (64305.0) (20721): logEvictEvent with unknown reason
(108), aborting
02/26 09:32:33 (64305.0) (20721): **** condor_shadow (condor_SHADOW) pid
20721 EXITING WITH STATUS 108
02/26 09:32:34 (63430.0) (19812): attempt to connect to
<131.225.216.64:9878> failed: timed out after 20 seconds.
02/26 09:32:34 (63430.0) (19812): Failed to reverse connect to
<134.158.73.56:52363> via CCB.
02/26 09:32:34 (63430.0) (19812): RemoteResource::killStarter(): Could
not send command to startd
02/26 09:32:34 (63430.0) (19812): logEvictEvent with unknown reason
(108), aborting
02/26 09:32:34 (63430.0) (19812): **** condor_shadow (condor_SHADOW) pid
19812 EXITING WITH STATUS 108
02/26 09:32:35 (63418.0) (20805): attempt to connect to
<131.225.216.64:9878> failed: timed out after 20 seconds.
02/26 09:32:35 (63418.0) (20805): Failed to reverse connect to
<134.158.73.11:36610> via CCB.
and snippet from the SchedLog:
02/26 09:58:03 (pid:26356) CCBClient: no more CCB servers to try for
requesting reversed connection to startd at <155.198.216.230:58302>;
giving up.
02/26 09:58:03 (pid:26356) Failed to send RELEASE_CLAIM to startd at
<155.198.216.230:58302>: SECMAN:2003:TCP connection to startd at
<155.198.216.230:58302> failed.
02/26 09:58:13 (pid:26356) attempt to connect to <131.225.216.64:9878>
failed: timed out after 20 seconds.
02/26 09:58:13 (pid:26356) attempt to connect to <131.225.216.64:9877>
failed: timed out after 20 seconds.
02/26 09:58:13 (pid:26356) Failed to send CCB_REQUEST to collector
131.225.216.64:9878: SECMAN:2003:TCP connection to collector
131.225.216.64:9878 failed.
02/26 09:58:13 (pid:26356) CCBClient: no more CCB servers to try for
requesting reversed connection to startd at <134.158.73.87:51338>;
giving up.
02/26 09:58:13 (pid:26356) Failed to send RELEASE_CLAIM to startd at
<134.158.73.87:51338>: SECMAN:2003:TCP connection to startd at
<134.158.73.87:51338> failed.
02/26 09:58:13 (pid:26356) Failed to send CCB_REQUEST to collector
131.225.216.64:9877: SECMAN:2003:TCP connection to collector
131.225.216.64:9877 failed.
02/26 09:58:13 (pid:26356) CCBClient: no more CCB servers to try for
requesting reversed connection to startd at <155.198.216.131:41092>;
giving up.
02/26 09:58:13 (pid:26356) Failed to send RELEASE_CLAIM to startd at
<155.198.216.131:41092>: SECMAN:2003:TCP connection to startd at
<155.198.216.131:41092> failed.
02/26 09:58:23 (pid:26356) attempt to connect to <155.198.217.34:60842>
failed: No route to host (connect errno = 113).
02/26 09:58:23 (pid:26356) Failed to send REQUEST_CLAIM to startd
glidein_13694@xxxxxxxxxxxxxxxxxxxxx <155.198.217.34:60842> for
samgrid@xxxxxxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd
glidein_13694@xxxxxxxxxxxxxxxxxxxxx <155.198.217.34:60842> for
samgrid@xxxxxxxxxxxxxxxxxx failed.
02/26 09:58:23 (pid:26356) Match record
(glidein_13694@xxxxxxxxxxxxxxxxxxxxx <155.198.217.34:60842> for
samgrid@xxxxxxxxxxxxxxxxxx, 62202.0) deleted
One of the Collector logs:
02/26 11:22:17 MasterAd : Inserting ** "<
glidein_9430@xxxxxxxxxxxxxxxxxxxxxxxxx >"
02/26 11:22:17 stats: Inserting new hashent for
'Master':'glidein_9430@xxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.98.36'
02/26 11:22:21 StartdAd : Inserting ** "<
monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx , 194.171.99.24 >"
02/26 11:22:21 stats: Inserting new hashent for
'Start':'monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.99.24'
02/26 11:22:21 StartdPvtAd : Inserting ** "<
monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx , 194.171.99.24 >"
02/26 11:22:21 stats: Inserting new hashent for
'StartdPvt':'monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.99.24'
02/26 11:22:22 MasterAd : Inserting ** "<
glidein_5940@xxxxxxxxxxxxxxxxxxxx >"
02/26 11:22:22 stats: Inserting new hashent for
'Master':'glidein_5940@xxxxxxxxxxxxxxxxxxxx':'134.158.73.25'
02/26 11:22:23 MasterAd : Inserting ** "<
glidein_9490@xxxxxxxxxxxxxxxxxxxxxxxxxx >"
02/26 11:22:23 stats: Inserting new hashent for
'Master':'glidein_9490@xxxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.99.133'
02/26 11:22:23 condor_write(): Socket closed when trying to write 285
bytes to <194.171.99.140:41788>, fd is 1064
02/26 11:22:23 Buf::write(): condor_write() failed
02/26 11:22:23 SECMAN: Error sending response classad!
MyType = "(unknown type)"
TargetType = "(unknown type)"
AuthMethods = "GSI"
CryptoMethods = "3DES,BLOWFISH"
OutgoingNegotiation = "REQUIRED"
Authentication = "REQUIRED"
Encryption = "OPTIONAL"
Integrity = "REQUIRED"
Enact = "NO"
Subsystem = "MASTER"
ServerPid = 12469
SessionDuration = "60"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 7.3.1 May 19 2009 BuildID: 154007 $"
ServerCommandSock = "<194.171.99.140:57692>"
Command = 67
02/26 11:22:23 condor_write(): Socket closed when trying to write 288
bytes to <194.171.99.139:45429>, fd is 1064
02/26 11:22:23 Buf::write(): condor_write() failed
02/26 11:22:23 SECMAN: Error sending response classad!
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/