Colin,
The situation you describe is caused by a job getting matched to a
startd that is no longer connected to the CCB server. If you can track
down the startd logs, it would be helpful to determine why the startd
was no longer connected. Was the startd dead? If not, did the startd
notice that it was disconnected from the CCB server? If not, perhaps
some network device silently dropped the connection. In some cases,
that can be avoided by configuring a shorter CCB_HEARTBEAT_INTERVAL,
which forces more frequent activity on the connection.
Another possible explanation is that the CCB server (i.e. the collector)
is running low on resources and therefore is failing to stay connected
to all of the daemons. I have seen this happen when using iptables with
too small a value for ip_conntrack_max on the CCB server machine.
Hope that helps.
--Dan
On 6/14/11 4:53 PM, Colin Leavett-Brown wrote:
We are running Condor 7.6.1 in a Xen virtual machine (both the real
host and VM have Scientific Linux SL release 5.5 (Boron) installed),
and we are seeing somewhere between 6% and 10% of our jobs being
evicted and restarted multiple times apparently because of CCB
failures. Also, jobs often experience CCB errors when starting which
causes the job to be delayed. From the ShadowLog, the following
messages are an extract for a job that is experiencing both kinds of
issue:
Problems when starting:
06/04/11 06:35:34 (3331.0) (25184): CCBClient: received failure
message from CCB server collector 206.12.154.58:9618 in response to
request for reversed connection to startd vm192.cloud.nrc.ca: CCB
server rejecting request for ccbid 9024 because no daemon is currently
registered with that id (perhaps it recently disconnected).
06/04/11 06:35:34 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:35:34 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
06/04/11 06:35:42 (3331.0) (25184): CCBClient: received failure
message from CCB server collector 206.12.154.58:9618 in response to
request for reversed connection to startd vm192.cloud.nrc.ca: CCB
server rejecting request for ccbid 9024 because no daemon is currently
registered with that id (perhaps it recently disconnected).
06/04/11 06:35:42 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:35:42 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
06/04/11 06:35:59 (3331.0) (25184): CCBClient: received failure
message from CCB server collector 206.12.154.58:9618 in response to
request for reversed connection to startd vm192.cloud.nrc.ca: CCB
server rejecting request for ccbid 9024 because no daemon is currently
registered with that id (perhaps it recently disconnected).
06/04/11 06:35:59 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:35:59 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
06/04/11 06:36:51 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:36:51 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
Communication error leading to eviction:
06/04/11 06:44:59 (3331.0) (25184): Job 3331.0 is being evicted from
vm192.cloud.nrc.ca
06/04/11 06:45:00 (3331.0) (25184): condor_read(): Socket closed
when trying to read 21 bytes from <132.246.148.92:40009>
06/04/11 06:45:00 (3331.0) (25184): DCStartd::deactivateClaim:
failed to read response ad.
06/04/11 06:45:00 (3331.0) (25184): **** condor_shadow
(condor_SHADOW) pid 25184 EXITING WITH STATUS 107
Has anyone else experienced these kind of problems?
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/