Hi all, I am struggling somewhat to spawn a test cluster (v8.6.4 on OpenStack, IPv4 only) where a Master, Collector, Negotiator and Scheduler running on one host (plus a few workers). Thing seems to be, that the collector cannot connect to itself(?). The master is restarting the collector several times but cannot connect to it [1]. As for the Collector log, the DemonCore also complaints about not being able to connect to the Collector (on the same host/IP). Network-wise I was able to communicate via ncat between a worker and the collector host on the shared port 9618 (& 9620). And -AFAIS- the collector is actually listening on 9618 [5]. (Sched and Negotiator seem to be happy and are listening to the DemonCore on p9620) Maybe somebody has an idea what could be jamming the collector? (being bound also to IPv6 link-local should be no problem, or??) Cheers and thanks for ideas, Thomas btw: is it actually necessary to set POOL_HISTORY_DIR [6] ~~> /var/ViewHist/ ? I had to create the directory manually but I do not remember that setting explicitly the dir had been necessary it before? [1] > MasterLog 06/30/17 16:01:12 SharedPortEndpoint: waiting for connections to named socket 4932_852f 06/30/17 16:01:12 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory 06/30/17 16:01:12 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s. 06/30/17 16:01:12 DaemonCore: private command socket at <131.169.240.85:0?sock=4932_852f> 06/30/17 16:01:12 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False) 06/30/17 16:01:12 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1498066869) 06/30/17 16:01:12 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 4978 06/30/17 16:01:12 Waiting for /var/lock/condor/shared_port_ad to appear. 06/30/17 16:01:13 Found /var/lock/condor/shared_port_ad. 06/30/17 16:01:13 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 4979 06/30/17 16:01:13 Waiting for /var/log/condor/.collector_address to appear. 06/30/17 16:01:14 Found /var/log/condor/.collector_address. 06/30/17 16:01:14 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 4980 06/30/17 16:01:14 Started DaemonCore process "/usr/libexec/condor/condor_gangliad", pid and pgroup = 4981 06/30/17 16:01:14 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 4982 06/30/17 16:01:14 Started DaemonCore process "/usr/libexec/condor/condor_defrag", pid and pgroup = 4983 06/30/17 16:01:14 DefaultReaper unexpectedly called on pid 4979, status 1024. 06/30/17 16:01:14 The COLLECTOR (pid 4979) exited with status 4 06/30/17 16:01:14 Sending obituary for "/usr/sbin/condor_collector" 06/30/17 16:01:14 restarting /usr/sbin/condor_collector in 10 seconds 06/30/17 16:01:14 attempt to connect to <131.169.240.85:9618> failed: Connection refused (connect errno = 111). 06/30/17 16:01:14 ERROR: SECMAN:2003:TCP connection to collector os-condor-dev-collector.desy.de:9618 failed. 06/30/17 16:01:14 Failed to start non-blocking update to <131.169.240.85:9618>. 06/30/17 16:01:14 DefaultReaper unexpectedly called on pid 4981, status 1024. 06/30/17 16:01:14 The GANGLIAD (pid 4981) exited with status 4 06/30/17 16:01:14 Sending obituary for "/usr/libexec/condor/condor_gangliad" 06/30/17 16:01:14 restarting /usr/libexec/condor/condor_gangliad in 10 seconds 06/30/17 16:01:14 attempt to connect to <131.169.240.85:9618> failed: Connection refused (connect errno = 111). 06/30/17 16:01:14 ERROR: SECMAN:2003:TCP connection to collector os-condor-dev-collector.desy.de:9618 failed. 06/30/17 16:01:14 Failed to start non-blocking update to <131.169.240.85:9618>. [2] > CollectorLog w. mkdir /var/ViewHist 06/30/17 16:05:51 MasterAd : Inserting ** "< os-condor-dev-collector.desy.de >" 06/30/17 16:05:51 Query info: matched=0; skipped=0; query_time=0.000982; send_time=0.000148; type=MachinePrivate; requirements={true}; peer=<131.169.240.85:25443>; projection={} 06/30/17 16:05:51 Number of Active Workers 0 06/30/17 16:05:51 creating new table for type Defrag 06/30/17 16:05:51 Defrag: Inserting ** "< os-condor-dev-collector.desy.de >" 06/30/17 16:05:51 (Sending 0 ads in response to query) 06/30/17 16:05:51 Query info: matched=0; skipped=2; query_time=0.001465; send_time=0.000103; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Machine" ) ) )}; peer=<131.169.240.85:3507>; projection={} 06/30/17 16:05:51 ScheddAd : Inserting ** "< os-condor-dev-collector.desy.de , 131.169.240.85 >" ... 06/30/17 16:05:51 AccountingAd : Inserting ** "< group_OPS >" 06/30/17 16:05:51 AccountingAd : Inserting ** "< group_OTHER >" 06/30/17 16:05:51 DaemonCore: Can't receive command request from 131.169.240.85 (perhaps a timeout?) 06/30/17 16:05:51 NegotiatorAd : Inserting ** "< NEGOTIATOR >" 06/30/17 16:05:54 Got QUERY_STARTD_ADS 06/30/17 16:05:54 Number of Active Workers 0 ... 06/30/17 16:11:32 DaemonCore: Can't receive command request from 131.169.240.85 (perhaps a timeout?) [3] os-condor-dev-batch02 > condor_status Error: communication error CONDOR_STATUS:1:Unable to resolve COLLECTOR_HOST (os-condor-dev-collector01.desy.de:9618). [4] ip addr | grep inet inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host inet 131.169.240.85/23 brd 131.169.241.255 scope global dynamic eth0 inet6 fe80::f816:3eff:fe62:d318/64 scope link [5] > netstat -tlnp | grep 9618 tcp 0 0 0.0.0.0:9618 0.0.0.0:* LISTEN 6653/condor_collect tcp6 0 0 :::9618 :::* LISTEN 6653/condor_collect [6] > CollectorLog 06/30/17 16:01:13 Setting maximum file descriptors to 10240. 06/30/17 16:01:13 ****************************************************** 06/30/17 16:01:13 ** condor_collector (CONDOR_COLLECTOR) STARTING UP 06/30/17 16:01:13 ** /usr/sbin/condor_collector 06/30/17 16:01:13 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1) 06/30/17 16:01:13 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON 06/30/17 16:01:13 ** $CondorVersion: 8.6.4 Jun 21 2017 BuildID: 408625 $ 06/30/17 16:01:13 ** $CondorPlatform: x86_64_RedHat7 $ 06/30/17 16:01:13 ** PID = 4979 06/30/17 16:01:13 ** Log last touched 6/30 16:00:40 06/30/17 16:01:13 ****************************************************** 06/30/17 16:01:13 Using config source: /etc/condor/condor_config 06/30/17 16:01:13 Using local config sources: 06/30/17 16:01:13 /etc/condor/config.d/00masterd.conf 06/30/17 16:01:13 /etc/condor/config.d/04defragd.conf 06/30/17 16:01:13 /etc/condor/config.d/06accounting.conf 06/30/17 16:01:13 /etc/condor/config.d/20rebooter.conf 06/30/17 16:01:13 /etc/condor/condor_config.local 06/30/17 16:01:13 config Macros = 126, Sorted = 126, StringBytes = 6392, TablesBytes = 4616 06/30/17 16:01:13 CLASSAD_CACHING is ENABLED 06/30/17 16:01:13 Daemon Log is logging: D_ALWAYS D_ERROR 06/30/17 16:01:13 SharedPortEndpoint: waiting for connections to named socket 4979_d11b 06/30/17 16:01:13 DaemonCore: non-shared command socket at <131.169.240.85:9618> 06/30/17 16:01:13 Daemoncore: Listening at <0.0.0.0:9618> on TCP (ReliSock) and UDP (SafeSock). 06/30/17 16:01:13 DaemonCore: non-shared command socket at <[::1]:9618> 06/30/17 16:01:13 WARNING: Condor is running on a loopback address 06/30/17 16:01:13 of this machine, and may not visible to other hosts! 06/30/17 16:01:13 Daemoncore: Listening at <[::]:9618> on TCP (ReliSock) and UDP (SafeSock). 06/30/17 16:01:13 DaemonCore: command socket at <131.169.240.85:9620?addrs=131.169.240.85-9620+[--1]-9620&noUDP&sock=4979_d11b> 06/30/17 16:01:13 DaemonCore: private command socket at <131.169.240.85:9620?addrs=131.169.240.85-9620+[--1]-9620&noUDP&sock=4979_d11b> 06/30/17 16:01:14 In ViewServer::Init() 06/30/17 16:01:14 In CollectorDaemon::Init() 06/30/17 16:01:14 In ViewServer::Config() 06/30/17 16:01:14 In CollectorDaemon::Config() 06/30/17 16:01:14 ABSENT_REQUIREMENTS = None 06/30/17 16:01:14 OfflineCollectorPlugin::configure: no persistent store was defined for off-line ads. 06/30/17 16:01:14 enable: Creating stats hash table 06/30/17 16:01:14 Enabling CCB Server. 06/30/17 16:01:14 m_reconnect_fname = /var/lib/condor/spool/131.169.240.85-9620.ccb_reconnect 06/30/17 16:01:14 Configuration: SAMPLING_INTERVAL=60, MAX_STORAGE=10000000, MaxFileSize=333333, POOL_HISTORY_DIR=/var/ViewHist 06/30/17 16:01:14 ERROR "POOL_HISTORY_DIR (/var/ViewHist) does not exist." at line 180 in file /slots/02/dir_4081266/userdir/.tmpO178Wi/BUILD/condor-8.6.4/src/condor_collector.V6/view_server.cpp 06/30/17 16:01:24 Setting maximum file descriptors to 10240.
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature