Thanks so much for getting back to me Brian. Iâve upped the log levels for the Master and pasted the results of a new start below. I also got an email from the server last night stating that the collector process had crashed:
This is an automated email from the Condor system on machine "vuwunicocondor03.ods.vuw.ac.nz". Do not reply. "/sbin/condor_collector" on "vuwunicocondor03.ods.vuw.ac.nz" was killed because it was no longer responding. Condor will automatically restart this process in 10 seconds. I actually have an extraordinarily keen user who actually sent a job to Condor last night and he reports that itâs been very slow, but some jobs are running. Anyway - hereâs a snip of the MasterLog - please let me know if you need any more info or any
extra testing or logs.
Many thanks, Craig
ââ
04/25/20 16:11:00 Result of reading /etc/issue: \S
04/25/20 16:11:00 Result of reading /etc/redhat-release: Red Hat Enterprise Linux Server release 7.7 (Maipo)
04/25/20 16:11:00 Using processor count: 2 processors, 2 CPUs, 0 HTs
04/25/20 16:11:00 Reading condor configuration from '/etc/condor/condor_config'
04/25/20 16:11:00 Enumerating interfaces: lo 127.0.0.1 up
04/25/20 16:11:00 Enumerating interfaces: eno16780032 10.40.18.11 up
04/25/20 16:11:00 ******************************************************
04/25/20 16:11:00 ** condor_master (CONDOR_MASTER) STARTING UP
04/25/20 16:11:00 ** /usr/sbin/condor_master
04/25/20 16:11:00 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
04/25/20 16:11:00 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
04/25/20 16:11:00 ** $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
04/25/20 16:11:00 ** $CondorPlatform: x86_64_RedHat7 $
04/25/20 16:11:00 ** PID = 1289047
04/25/20 16:11:00 ** Log last touched 4/25 16:09:41
04/25/20 16:11:00 ******************************************************
04/25/20 16:11:00 Using config source: /etc/condor/condor_config
04/25/20 16:11:00 Using local config sources:
04/25/20 16:11:00 /etc/condor/config.d/00VUWCondor_config.local
04/25/20 16:11:00 /etc/condor/config.d/00VUWCondor_config.local
04/25/20 16:11:00 config Macros = 115, Sorted = 115, StringBytes = 3967, TablesBytes = 4196
04/25/20 16:11:00 CLASSAD_CACHING is OFF
04/25/20 16:11:00 Daemon Log is logging: D_FULLDEBUG D_ALWAYS D_ERROR
04/25/20 16:11:00 Attempting to lock /var/log/condor/InstanceLock.
04/25/20 16:11:00 FileLock object is updating timestamp on: /var/log/condor/InstanceLock
04/25/20 16:11:00 FileLock::obtain(1) - @1587787860.331143 lock on /var/log/condor/InstanceLock now WRITE
04/25/20 16:11:00 Obtained lock on /var/log/condor/InstanceLock.
04/25/20 16:11:00 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:01 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:02 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:03 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:04 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:05 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:06 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:07 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:08 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:09 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:10 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:11 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:12 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:13 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:14 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:15 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:16 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:17 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:18 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:19 Waited too long for kernel parameters to be tuned, hard-killing script.
04/25/20 16:11:19 SharedPortEndpoint: waiting for connections to named socket 1289047_9a26
04/25/20 16:11:19 SharedPortEndpoint: failed to open /var/log/condor/shared_port_ad: No such file or directory
04/25/20 16:11:19 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
04/25/20 16:11:19 DaemonCore: private command socket at <10.40.18.11:0?sock=1289047_9a26>
04/25/20 16:11:19 Setting maximum accepts per cycle 8.
04/25/20 16:11:19 Will use TCP to update collector
vuwunicocondor03.ods.vuw.ac.nz <10.40.18.11:9618>
04/25/20 16:11:19 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
04/25/20 16:11:19 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1540925514)
04/25/20 16:11:19 No sockets passed from systemd
04/25/20 16:11:19 Set systemd to be notified once every 400 seconds.
04/25/20 16:11:19 ::RealStart; SHARED_PORT >
04/25/20 16:11:19 Looking for matching Collector on 'vuwunicocondor03.ods.vuw.ac.nz' ...
04/25/20 16:11:19 Matching 'vuwunicocondor03.ods.vuw.ac.nz:9618'
04/25/20 16:11:19 Host name matches collector <10.40.18.11:9618>.
04/25/20 16:11:19 Finished looking for Collectors.
04/25/20 16:11:19 Starting Collector on port 9618
04/25/20 16:11:19 Starting daemon on TCP port 9618
04/25/20 16:11:19 Started DaemonCore process "/usr/libexec/condor/condor_shared_port -f", pid and pgroup = 1289257
04/25/20 16:11:19 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:19 FileLock object is updating timestamp on: /var/log/condor/InstanceLock
04/25/20 16:11:19 DaemonCore: No more children processes to reap.
04/25/20 16:11:19 Getting monitoring info for pid 1289047
04/25/20 16:11:20 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:21 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:22 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:23 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:24 enter Daemons::UpdateCollector
04/25/20 16:11:24 Trying to update collector <10.40.18.11:9618>
04/25/20 16:11:24 Attempting to send update via TCP to collector
vuwunicocondor03.ods.vuw.ac.nz <10.40.18.11:9618>
04/25/20 16:11:24 File descriptor limits: max 32767, safe 26214
04/25/20 16:11:24 exit Daemons::UpdateCollector
04/25/20 16:11:24 enter Daemons::CheckForNewExecutable
04/25/20 16:11:24 Time stamp of running /usr/sbin/condor_master: 1540925514
04/25/20 16:11:24 GetTimeStamp returned: 1540925514
04/25/20 16:11:24 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:24 condor_read() failed: recv() 5 bytes from collector
vuwunicocondor03.ods.vuw.ac.nz returned -1, timeout=20, errno=104 Connection reset by peer.
04/25/20 16:11:24 IO: Failed to read packet header
04/25/20 16:11:24 condor_read(): Socket closed when trying to read 5 bytes from collector
vuwunicocondor03.ods.vuw.ac.nz in non-blocking mode
04/25/20 16:11:24 IO: EOF reading packet header
04/25/20 16:11:24 condor_read(): Socket closed when trying to read 5 bytes from collector
vuwunicocondor03.ods.vuw.ac.nz
04/25/20 16:11:24 IO: EOF reading packet header
04/25/20 16:11:24 SECMAN: no classad from server, failing
04/25/20 16:11:24 ERROR: SECMAN:2007:Failed to end classad message.
04/25/20 16:11:24 Failed to start non-blocking update to <10.40.18.11:9618>.
04/25/20 16:11:44 Found /var/log/condor/shared_port_ad.
04/25/20 16:11:44 ::RealStart; COLLECTOR >
04/25/20 16:11:44 Looking for matching Collector on 'vuwunicocondor03.ods.vuw.ac.nz' ...
04/25/20 16:11:44 Matching 'vuwunicocondor03.ods.vuw.ac.nz:9618'
04/25/20 16:11:44 Host name matches collector <10.40.18.11:9618>.
04/25/20 16:11:44 Finished looking for Collectors.
04/25/20 16:11:44 Starting collector with shared port id collector
04/25/20 16:11:44 Starting daemon with shared port id collector
04/25/20 16:11:44 Started DaemonCore process "/sbin/condor_collector -f", pid and pgroup = 1289278
04/25/20 16:11:44 Waiting for /var/log/condor/.collector_address to appear.
04/25/20 16:11:45 Found /var/log/condor/.collector_address.
04/25/20 16:11:45 ::RealStart; NEGOTIATOR >
04/25/20 16:11:45 Started DaemonCore process "/sbin/condor_negotiator -f", pid and pgroup = 1289280
04/25/20 16:11:45 ::RealStart; SCHEDD >
04/25/20 16:11:45 Started DaemonCore process "/usr/sbin/condor_schedd -f", pid and pgroup = 1289281
04/25/20 16:15:19 Getting monitoring info for pid 1289047
04/25/20 16:16:24 enter Daemons::UpdateCollector
04/25/20 16:16:24 Trying to update collector <10.40.18.11:9618>
04/25/20 16:16:24 Attempting to send update via TCP to collector
vuwunicocondor03.ods.vuw.ac.nz <10.40.18.11:9618>
04/25/20 16:16:24 exit Daemons::UpdateCollector
04/25/20 16:16:24 enter Daemons::CheckForNewExecutable
04/25/20 16:16:24 Time stamp of running /usr/sbin/condor_master: 1540925514
04/25/20 16:16:24 GetTimeStamp returned: 1540925514
04/25/20 16:18:25 ERROR: SECMAN:2003:deadline for security handshake with collector
vuwunicocondor03.ods.vuw.ac.nz has expired.
04/25/20 16:18:25 Failed to start non-blocking update to <10.40.18.11:9618>.
04/25/20 16:19:19 Getting monitoring info for pid 1289047
|