Can someone point me to what could be causing the following to happen? Everything is condor 8.4.7 from the wisc.edu repo. The MasterLog starts throwing this message when I upgraded from 8.2.8 to 8.4.7. There’s an unknown protocol error
in the collector log which seems to correspond to the times. It’s always after the housekeeper done cleaning message.
----- master log 06/27/16 13:42:34 condor_write(): Socket closed when trying to write 1276 bytes to collector 10.1.1.55, fd is 11 06/27/16 13:42:34 Buf::write(): condor_write() failed 06/27/16 13:42:34 attempt to connect to <10.1.1.55:9618> failed: Connection refused (connect errno = 111). 06/27/16 13:42:34 ERROR: SECMAN:2003:TCP connection to collector 10.1.1.55 failed. 06/27/16 13:42:34 Failed to start non-blocking update to <10.1.1.55:9618>. 06/27/16 13:42:45 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 46684 06/27/16 13:57:46 DefaultReaper unexpectedly called on pid 46684, status 1024. 06/27/16 13:57:46 The COLLECTOR (pid 46684) exited with status 4 06/27/16 13:57:46 Sending obituary for "/usr/sbin/condor_collector" 06/27/16 13:57:47 restarting /usr/sbin/condor_collector in 10 seconds 06/27/16 13:57:47 condor_write(): Socket closed when trying to write 1277 bytes to collector 10.1.1.55, fd is 11 06/27/16 13:57:47 Buf::write(): condor_write() failed 06/27/16 13:57:47 attempt to connect to <10.1.1.55:9618> failed: Connection refused (connect errno = 111). 06/27/16 13:57:47 ERROR: SECMAN:2003:TCP connection to collector 10.1.1.55 failed. 06/27/16 13:57:47 Failed to start non-blocking update to <10.1.1.55:9618>. 06/27/16 13:57:57 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 46794 ----- collector log 06/27/16 13:42:00 Query info: matched=89; skipped=9; query_time=0.003030; send_time=0.031904; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( 06/27/16 13:42:33 Housekeeper: Ready to clean old ads 06/27/16 13:42:33 Cleaning StartdAds ... 06/27/16 13:42:33 Cleaning StartdPrivateAds ... 06/27/16 13:42:33 Cleaning ScheddAds ... 06/27/16 13:42:33 Cleaning SubmittorAds ... 06/27/16 13:42:33 Cleaning LicenseAds ... 06/27/16 13:42:33 Cleaning MasterAds ... 06/27/16 13:42:33 Cleaning CkptServerAds ... 06/27/16 13:42:33 Cleaning CollectorAds ... 06/27/16 13:42:33 Cleaning StorageAds ... 06/27/16 13:42:33 Cleaning NegotiatorAds ... 06/27/16 13:42:33 Cleaning HadAds ... 06/27/16 13:42:33 Cleaning GridAds ... 06/27/16 13:42:33 Cleaning XferServiceAds ... 06/27/16 13:42:33 Cleaning LeaseManagerAds ... 06/27/16 13:42:33 Cleaning Generic Ads ... 06/27/16 13:42:33 Housekeeper: Done cleaning 06/27/16 13:42:34 ERROR "Unknown protocol (1) in Sock::bind(); aborting." at line 741 in file /slots/01/dir_1114870/userdir/.tmpthm9vL/BUILD/condor-8.4. 06/27/16 13:42:45 Setting maximum file descriptors to 10240. 06/27/16 13:42:45 ****************************************************** 06/27/16 13:42:45 ** condor_collector (CONDOR_COLLECTOR) STARTING UP 06/27/16 13:42:45 ** /usr/sbin/condor_collector 06/27/16 13:42:45 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1) 06/27/16 13:42:45 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON 06/27/16 13:42:45 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $ 06/27/16 13:42:45 ** $CondorPlatform: x86_64_RedHat7 $ 06/27/16 13:42:45 ** PID = 46684 06/27/16 13:42:45 ** Log last touched 6/27 13:42:34 06/27/16 13:42:45 ****************************************************** 10.1.1.55 is the condor host running centos 7. The firewall is active but that interface is in the trusted zone. Selinux is off. It is virtual on ESXi if that makes any difference (4cpu 4gb mem). It’s config is ------------ CONDOR_HOST = 10.1.1.55 COLLECTOR_NAME = AGBU ALLOW_READ = 10.1.* ALLOW_WRITE = 10.1.* DEFAULT_DOMAIN_NAME = agbu.localdomain NO_DNS = True TRUST_UID_DOMAIN = True BIND_ALL_INTERFACES = False NETWORK_INTERFACE = 10.1.1.55 START = True DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD ------------------- My laptop is on the same network and it isn’t having any trouble maintaining the ssh connection to 10.1.1.55. There’s no entries in /var/log/messages indicating any issues. Ideas anyone? --
|