[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] protocol error in collector after housekeeping



Can someone point me to what could be causing the following to happen?  Everything is condor 8.4.7 from the wisc.edu repo.  The MasterLog starts throwing this message when I upgraded from 8.2.8 to 8.4.7. There’s an unknown protocol error in the collector log which seems to correspond to the times.  It’s always after the housekeeper done cleaning message. 

 

----- master log

06/27/16 13:42:34 condor_write(): Socket closed when trying to write 1276 bytes to collector 10.1.1.55, fd is 11

06/27/16 13:42:34 Buf::write(): condor_write() failed

06/27/16 13:42:34 attempt to connect to <10.1.1.55:9618> failed: Connection refused (connect errno = 111).

06/27/16 13:42:34 ERROR: SECMAN:2003:TCP connection to collector 10.1.1.55 failed.

06/27/16 13:42:34 Failed to start non-blocking update to <10.1.1.55:9618>.

06/27/16 13:42:45 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 46684

06/27/16 13:57:46 DefaultReaper unexpectedly called on pid 46684, status 1024.

06/27/16 13:57:46 The COLLECTOR (pid 46684) exited with status 4

06/27/16 13:57:46 Sending obituary for "/usr/sbin/condor_collector"

06/27/16 13:57:47 restarting /usr/sbin/condor_collector in 10 seconds

06/27/16 13:57:47 condor_write(): Socket closed when trying to write 1277 bytes to collector 10.1.1.55, fd is 11

06/27/16 13:57:47 Buf::write(): condor_write() failed

06/27/16 13:57:47 attempt to connect to <10.1.1.55:9618> failed: Connection refused (connect errno = 111).

06/27/16 13:57:47 ERROR: SECMAN:2003:TCP connection to collector 10.1.1.55 failed.

06/27/16 13:57:47 Failed to start non-blocking update to <10.1.1.55:9618>.

06/27/16 13:57:57 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 46794

 

----- collector log

06/27/16 13:42:00 Query info: matched=89; skipped=9; query_time=0.003030; send_time=0.031904; type=Any; requirements={( ( ( MyType == "Scheduler" ) || (

06/27/16 13:42:33 Housekeeper:  Ready to clean old ads

06/27/16 13:42:33       Cleaning StartdAds ...

06/27/16 13:42:33       Cleaning StartdPrivateAds ...

06/27/16 13:42:33       Cleaning ScheddAds ...

06/27/16 13:42:33       Cleaning SubmittorAds ...

06/27/16 13:42:33       Cleaning LicenseAds ...

06/27/16 13:42:33       Cleaning MasterAds ...

06/27/16 13:42:33       Cleaning CkptServerAds ...

06/27/16 13:42:33       Cleaning CollectorAds ...

06/27/16 13:42:33       Cleaning StorageAds ...

06/27/16 13:42:33       Cleaning NegotiatorAds ...

06/27/16 13:42:33       Cleaning HadAds ...

06/27/16 13:42:33       Cleaning GridAds ...

06/27/16 13:42:33       Cleaning XferServiceAds ...

06/27/16 13:42:33       Cleaning LeaseManagerAds ...

06/27/16 13:42:33       Cleaning Generic Ads ...

06/27/16 13:42:33 Housekeeper:  Done cleaning

06/27/16 13:42:34 ERROR "Unknown protocol (1) in Sock::bind(); aborting." at line 741 in file /slots/01/dir_1114870/userdir/.tmpthm9vL/BUILD/condor-8.4.

06/27/16 13:42:45 Setting maximum file descriptors to 10240.

06/27/16 13:42:45 ******************************************************

06/27/16 13:42:45 ** condor_collector (CONDOR_COLLECTOR) STARTING UP

06/27/16 13:42:45 ** /usr/sbin/condor_collector

06/27/16 13:42:45 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1)

06/27/16 13:42:45 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON

06/27/16 13:42:45 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $

06/27/16 13:42:45 ** $CondorPlatform: x86_64_RedHat7 $

06/27/16 13:42:45 ** PID = 46684

06/27/16 13:42:45 ** Log last touched 6/27 13:42:34

06/27/16 13:42:45 ******************************************************

 

 

10.1.1.55 is the condor host running centos 7. The firewall is active but that interface is in the trusted zone.  Selinux is off.  It is virtual on ESXi if that makes any difference (4cpu 4gb mem).

It’s config is

 

------------

CONDOR_HOST = 10.1.1.55

COLLECTOR_NAME          = AGBU

ALLOW_READ = 10.1.*

ALLOW_WRITE = 10.1.*

DEFAULT_DOMAIN_NAME = agbu.localdomain

NO_DNS = True

TRUST_UID_DOMAIN = True

BIND_ALL_INTERFACES = False

NETWORK_INTERFACE = 10.1.1.55

 

START = True

DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD

-------------------

 

My laptop is on the same network and it isn’t having any trouble maintaining the ssh connection to 10.1.1.55.  There’s no entries in /var/log/messages indicating any issues.

 

Ideas anyone?

 

 

--
Klint Gore
Database Manager
Sheep CRC
A.G.B.U.
University of New England
Armidale NSW 2350

Ph: 02 6773 3789  
Fax: 02 6773 3266
EMail: kgore4@xxxxxxxxxx