Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] CCB errors leading to job evictions.
- Date: Tue, 14 Jun 2011 14:53:22 -0700
- From: Colin Leavett-Brown <crlb@xxxxxxx>
- Subject: [Condor-users] CCB errors leading to job evictions.
We are running Condor 7.6.1 in a Xen virtual machine (both the real host
and VM have Scientific Linux SL release 5.5 (Boron) installed), and we
are seeing somewhere between 6% and 10% of our jobs being evicted and
restarted multiple times apparently because of CCB failures. Also, jobs
often experience CCB errors when starting which causes the job to be
delayed. From the ShadowLog, the following messages are an extract for a
job that is experiencing both kinds of issue:
Problems when starting:
06/04/11 06:35:34 (3331.0) (25184): CCBClient: received failure message
from CCB server collector 206.12.154.58:9618 in response to request for
reversed connection to startd vm192.cloud.nrc.ca: CCB server rejecting
request for ccbid 9024 because no daemon is currently registered with
that id (perhaps it recently disconnected).
06/04/11 06:35:34 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:35:34 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
06/04/11 06:35:42 (3331.0) (25184): CCBClient: received failure
message from CCB server collector 206.12.154.58:9618 in response to
request for reversed connection to startd vm192.cloud.nrc.ca: CCB server
rejecting request for ccbid 9024 because no daemon is currently
registered with that id (perhaps it recently disconnected).
06/04/11 06:35:42 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:35:42 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
06/04/11 06:35:59 (3331.0) (25184): CCBClient: received failure
message from CCB server collector 206.12.154.58:9618 in response to
request for reversed connection to startd vm192.cloud.nrc.ca: CCB server
rejecting request for ccbid 9024 because no daemon is currently
registered with that id (perhaps it recently disconnected).
06/04/11 06:35:59 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:35:59 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
06/04/11 06:36:51 (3331.0) (25184): Failed to reverse connect to
startd vm192.cloud.nrc.ca via CCB.
06/04/11 06:36:51 (3331.0) (25184): locateStarter(): Failed to
connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>
Communication error leading to eviction:
06/04/11 06:44:59 (3331.0) (25184): Job 3331.0 is being evicted from
vm192.cloud.nrc.ca
06/04/11 06:45:00 (3331.0) (25184): condor_read(): Socket closed when
trying to read 21 bytes from <132.246.148.92:40009>
06/04/11 06:45:00 (3331.0) (25184): DCStartd::deactivateClaim: failed
to read response ad.
06/04/11 06:45:00 (3331.0) (25184): **** condor_shadow
(condor_SHADOW) pid 25184 EXITING WITH STATUS 107
Has anyone else experienced these kind of problems?
--
Colin Leavett-Brown
Department of Physics & Astronomy
University of Victoria
250-721-7728