Hi all, hope you are all well during these turbulent times.
I have a weird problem with my HTCondor instnace here at VUW over the past couple of days. Jobs were alternately being held âidleâ for long periods of time for no reasons discernable by me, and I *think* I may have discovered and resolved an issue afecting
this with a number of the client machines, where their SharedPort addresses were being set to 127.0.0.1.
I thought that it was all sorted, but returning to the server for some final testing has ruined my day a little. Iâm currently unable to restart the Condor service on the server, and looking at the MasterLog it seems that the machine isnât able to determine
its own communication addresses - if that makes sense. Below I have a snip of the MasterLog during Condor startup: the long time it takes for the 'shared_port_ad' file to appear looks to be suspicious to me?
Honestly Iâm a little lost with this though, and would really appreciate any kind of assistance at all, even if itâs just to say Iâm barking up the wrong tree. If you need any more info or logs etc, please let me know and Iâll get them to you.
04/24/20 15:34:59 ** condor_master (CONDOR_MASTER) STARTING UP
04/24/20 15:34:59 ** /usr/sbin/condor_master
04/24/20 15:34:59 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
04/24/20 15:34:59 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
04/24/20 15:34:59 ** $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
04/24/20 15:34:59 ** $CondorPlatform: x86_64_RedHat7 $
04/24/20 15:34:59 ** PID = 1034787
04/24/20 15:34:59 ** Log last touched 4/24 15:32:07
04/24/20 15:34:59 ******************************************************
04/24/20 15:34:59 Using config source: /etc/condor/condor_config
04/24/20 15:34:59 Using local config sources:
04/24/20 15:34:59 /etc/condor/config.d/00VUWCondor_config.local
04/24/20 15:34:59 /etc/condor/config.d/00VUWCondor_config.local
04/24/20 15:34:59 config Macros = 114, Sorted = 114, StringBytes = 3955, TablesBytes = 4160
04/24/20 15:34:59 CLASSAD_CACHING is OFF
04/24/20 15:34:59 Daemon Log is logging: D_ALWAYS D_ERROR
04/24/20 15:35:18 SharedPortEndpoint: waiting for connections to named socket 1034787_63fd
04/24/20 15:35:18 SharedPortEndpoint: failed to open /var/log/condor/shared_port_ad: No such file or directory
04/24/20 15:35:18 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
04/24/20 15:35:18 DaemonCore: private command socket at <10.40.18.11:0?sock=1034787_63fd>
04/24/20 15:35:18 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
04/24/20 15:35:18 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1540925514)
04/24/20 15:35:18 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1034852
04/24/20 15:35:18 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:19 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:20 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:21 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:22 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:23 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:23 condor_read() failed: recv() 5 bytes from collector vuwunicocondor03.ods.vuw.ac.nz returned
-1, timeout=20, errno=104 Connection reset by peer.
04/24/20 15:35:23 IO: Failed to read packet header
04/24/20 15:35:23 SECMAN: no classad from server, failing
04/24/20 15:35:23 ERROR: SECMAN:2007:Failed to end classad message.
04/24/20 15:35:23 Failed to start non-blocking update to <10.40.18.11:9618>.
04/24/20 15:35:43 Found /var/log/condor/shared_port_ad.
04/24/20 15:35:43 Started DaemonCore process "/sbin/condor_collector", pid and pgroup = 1034877
04/24/20 15:35:43 Waiting for /var/log/condor/.collector_address to appear.
04/24/20 15:35:44 Found /var/log/condor/.collector_address.
04/24/20 15:35:44 Started DaemonCore process "/sbin/condor_negotiator", pid and pgroup = 1034879
04/24/20 15:35:44 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1034880
04/24/20 15:42:24 Failed to start non-blocking update to <10.40.18.11:9618>.
04/24/20 15:47:24 Failed to start non-blocking update to <10.40.18.11:9618>.