We are running Condor 7.6.1 in a Xen virtual machine (both the real
host and VM have Scientific Linux SL release 5.5 (Boron) installed),
and we are seeing somewhere between 6% and 10% of our jobs being
evicted and restarted multiple times apparently because of CCB
failures. Also, jobs often experience CCB errors when starting which
causes the job to be delayed. From the ShadowLog, the following
messages are an extract for a job that is experiencing both kinds of
issue:
Problems when starting:
06/04/11 06:35:34 (3331.0) (25184): CCBClient: received failure
message from CCB server collector 206.12.154.58:9618 in response to
request for reversed connection to startd vm192.cloud.nrc.ca: CCB
server rejecting request for ccbid 9024 because no daemon is currently
registered with that id (perhaps it recently disconnected).
06/04/11 06:35:34 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:35:34 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
06/04/11 06:35:42 (3331.0) (25184): CCBClient: received failure
message from CCB server collector 206.12.154.58:9618 in response to
request for reversed connection to startd vm192.cloud.nrc.ca: CCB
server rejecting request for ccbid 9024 because no daemon is currently
registered with that id (perhaps it recently disconnected).
06/04/11 06:35:42 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:35:42 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
06/04/11 06:35:59 (3331.0) (25184): CCBClient: received failure
message from CCB server collector 206.12.154.58:9618 in response to
request for reversed connection to startd vm192.cloud.nrc.ca: CCB
server rejecting request for ccbid 9024 because no daemon is currently
registered with that id (perhaps it recently disconnected).
06/04/11 06:35:59 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:35:59 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
06/04/11 06:36:51 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:36:51 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
Communication error leading to eviction:
06/04/11 06:44:59 (3331.0) (25184): Job 3331.0 is being evicted from
vm192.cloud.nrc.ca
06/04/11 06:45:00 (3331.0) (25184): condor_read(): Socket closed
when trying to read 21 bytes from <132.246.148.92:40009>
06/04/11 06:45:00 (3331.0) (25184): DCStartd::deactivateClaim:
failed to read response ad.
06/04/11 06:45:00 (3331.0) (25184): **** condor_shadow
(condor_SHADOW) pid 25184 EXITING WITH STATUS 107
Has anyone else experienced these kind of problems?