Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] production ccb pool not communicating
- Date: Fri, 26 Feb 2010 11:26:18 -0600
- From: Joe Boyd <boyd@xxxxxxxx>
- Subject: [Condor-users] production ccb pool not communicating
Hello,
I've got a production condor pool that has been running for a while but is now
getting communication errors with multiple remote sites. CCB is used to talk to
the remote sites. The local machine is 131.225.216.64 and the schedd log seems
to be saying that the local schedd can't connect to the local CCB servers
running on ports 9877, 9878, and 9879.
This pool was in use and I'm not aware of any config changes made. I have
restarted the pool and removed all the remote glideins since jobs weren't runnng
anyway. Nothing helped.
The remote machines do seem to be properly reporting back to the Collector via
CCB and it's the local daemons that don't seem to be able to communicate.
Any help appreciated.
joe
Here is a snippet of the ShadowLog:
02/26 09:32:19 (63446.0) (21089): attempt to connect to <131.225.216.64:9878>
failed: timed out after 20 seconds.
02/26 09:32:19 (63446.0) (21089): Failed to reverse connect to startd
glidein_19053@xxxxxxxxxxxxxxxxxxxx via CCB.
02/26 09:32:19 (63446.0) (21089): glidein_19053@xxxxxxxxxxxxxxxxxxxx:
DCStartd::activateClaim: Failed to send command ACTIVATE_CLAIM to the startd
02/26 09:32:19 (63446.0) (21089): Job 63446.0 is being evicted from
glidein_19053@xxxxxxxxxxxxxxxxxxxx
02/26 09:32:25 (63401.0) (21462): attempt to connect to <131.225.216.64:9879>
failed: timed out after 20 seconds.
02/26 09:32:25 (63401.0) (21462): Failed to reverse connect to startd
glidein_1605@xxxxxxxxxxxxxxxxxxxx via CCB.
02/26 09:32:25 (63401.0) (21462): glidein_1605@xxxxxxxxxxxxxxxxxxxx:
DCStartd::activateClaim: Failed to send command ACTIVATE_CLAIM to the startd
02/26 09:32:25 (63401.0) (21462): Job 63401.0 is being evicted from
glidein_1605@xxxxxxxxxxxxxxxxxxxx
02/26 09:32:33 (64305.0) (20721): attempt to connect to <131.225.216.64:9878>
failed: timed out after 20 seconds.
02/26 09:32:33 (64305.0) (20721): Failed to reverse connect to
<145.100.48.37:53084> via CCB.
02/26 09:32:33 (64305.0) (20721): RemoteResource::killStarter(): Could not send
command to startd
02/26 09:32:33 (64305.0) (20721): logEvictEvent with unknown reason (108), aborting
02/26 09:32:33 (64305.0) (20721): **** condor_shadow (condor_SHADOW) pid 20721
EXITING WITH STATUS 108
02/26 09:32:34 (63430.0) (19812): attempt to connect to <131.225.216.64:9878>
failed: timed out after 20 seconds.
02/26 09:32:34 (63430.0) (19812): Failed to reverse connect to
<134.158.73.56:52363> via CCB.
02/26 09:32:34 (63430.0) (19812): RemoteResource::killStarter(): Could not send
command to startd
02/26 09:32:34 (63430.0) (19812): logEvictEvent with unknown reason (108), aborting
02/26 09:32:34 (63430.0) (19812): **** condor_shadow (condor_SHADOW) pid 19812
EXITING WITH STATUS 108
02/26 09:32:35 (63418.0) (20805): attempt to connect to <131.225.216.64:9878>
failed: timed out after 20 seconds.
02/26 09:32:35 (63418.0) (20805): Failed to reverse connect to
<134.158.73.11:36610> via CCB.
and snippet from the SchedLog:
02/26 09:58:03 (pid:26356) CCBClient: no more CCB servers to try for requesting
reversed connection to startd at <155.198.216.230:58302>; giving up.
02/26 09:58:03 (pid:26356) Failed to send RELEASE_CLAIM to startd at
<155.198.216.230:58302>: SECMAN:2003:TCP connection to startd at
<155.198.216.230:58302> failed.
02/26 09:58:13 (pid:26356) attempt to connect to <131.225.216.64:9878> failed:
timed out after 20 seconds.
02/26 09:58:13 (pid:26356) attempt to connect to <131.225.216.64:9877> failed:
timed out after 20 seconds.
02/26 09:58:13 (pid:26356) Failed to send CCB_REQUEST to collector
131.225.216.64:9878: SECMAN:2003:TCP connection to collector 131.225.216.64:9878
failed.
02/26 09:58:13 (pid:26356) CCBClient: no more CCB servers to try for requesting
reversed connection to startd at <134.158.73.87:51338>; giving up.
02/26 09:58:13 (pid:26356) Failed to send RELEASE_CLAIM to startd at
<134.158.73.87:51338>: SECMAN:2003:TCP connection to startd at
<134.158.73.87:51338> failed.
02/26 09:58:13 (pid:26356) Failed to send CCB_REQUEST to collector
131.225.216.64:9877: SECMAN:2003:TCP connection to collector 131.225.216.64:9877
failed.
02/26 09:58:13 (pid:26356) CCBClient: no more CCB servers to try for requesting
reversed connection to startd at <155.198.216.131:41092>; giving up.
02/26 09:58:13 (pid:26356) Failed to send RELEASE_CLAIM to startd at
<155.198.216.131:41092>: SECMAN:2003:TCP connection to startd at
<155.198.216.131:41092> failed.
02/26 09:58:23 (pid:26356) attempt to connect to <155.198.217.34:60842> failed:
No route to host (connect errno = 113).
02/26 09:58:23 (pid:26356) Failed to send REQUEST_CLAIM to startd
glidein_13694@xxxxxxxxxxxxxxxxxxxxx <155.198.217.34:60842> for
samgrid@xxxxxxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd
glidein_13694@xxxxxxxxxxxxxxxxxxxxx <155.198.217.34:60842> for
samgrid@xxxxxxxxxxxxxxxxxx failed.
02/26 09:58:23 (pid:26356) Match record (glidein_13694@xxxxxxxxxxxxxxxxxxxxx
<155.198.217.34:60842> for samgrid@xxxxxxxxxxxxxxxxxx, 62202.0) deleted
One of the Collector logs:
02/26 11:22:17 MasterAd : Inserting ** "<
glidein_9430@xxxxxxxxxxxxxxxxxxxxxxxxx >"
02/26 11:22:17 stats: Inserting new hashent for
'Master':'glidein_9430@xxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.98.36'
02/26 11:22:21 StartdAd : Inserting ** "<
monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx , 194.171.99.24 >"
02/26 11:22:21 stats: Inserting new hashent for
'Start':'monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.99.24'
02/26 11:22:21 StartdPvtAd : Inserting ** "<
monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx , 194.171.99.24 >"
02/26 11:22:21 stats: Inserting new hashent for
'StartdPvt':'monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.99.24'
02/26 11:22:22 MasterAd : Inserting ** "< glidein_5940@xxxxxxxxxxxxxxxxxxxx >"
02/26 11:22:22 stats: Inserting new hashent for
'Master':'glidein_5940@xxxxxxxxxxxxxxxxxxxx':'134.158.73.25'
02/26 11:22:23 MasterAd : Inserting ** "<
glidein_9490@xxxxxxxxxxxxxxxxxxxxxxxxxx >"
02/26 11:22:23 stats: Inserting new hashent for
'Master':'glidein_9490@xxxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.99.133'
02/26 11:22:23 condor_write(): Socket closed when trying to write 285 bytes to
<194.171.99.140:41788>, fd is 1064
02/26 11:22:23 Buf::write(): condor_write() failed
02/26 11:22:23 SECMAN: Error sending response classad!
MyType = "(unknown type)"
TargetType = "(unknown type)"
AuthMethods = "GSI"
CryptoMethods = "3DES,BLOWFISH"
OutgoingNegotiation = "REQUIRED"
Authentication = "REQUIRED"
Encryption = "OPTIONAL"
Integrity = "REQUIRED"
Enact = "NO"
Subsystem = "MASTER"
ServerPid = 12469
SessionDuration = "60"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 7.3.1 May 19 2009 BuildID: 154007 $"
ServerCommandSock = "<194.171.99.140:57692>"
Command = 67
02/26 11:22:23 condor_write(): Socket closed when trying to write 288 bytes to
<194.171.99.139:45429>, fd is 1064
02/26 11:22:23 Buf::write(): condor_write() failed
02/26 11:22:23 SECMAN: Error sending response classad!