I apologize – I found the issue. /var is full. Badly mapped disks. Michael Fienen, Ph. D. From:
HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Fienen, Michael N via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Hello Condor World! Been a minute…. We have been running rock-solid for months, but just ran into a problem where a user submitted a job and all staying in Idle state. I rebooted the schedd and getting errors for it trying to come back. From
MasterLog: 03/28/21 22:14:29 DefaultReaper unexpectedly called on pid 2310192, status 11264. 03/28/21 22:14:29 The COLLECTOR (pid 2310192) exited with status 44 03/28/21 22:14:29 Sending obituary for "/usr/sbin/condor_collector" 03/28/21 22:14:29 restarting /usr/sbin/condor_collector in 10 seconds 03/28/21 22:14:29 condor_write(): Socket closed when trying to write 1513 bytes to collector <schedd_name_here>, fd is 10 03/28/21 22:14:29 Buf::write(): condor_write() failed 03/28/21 22:14:29 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector <schedd_name_here>, in non-blocking mode, errno=104 Connection reset by peer 03/28/21 22:14:29 SECMAN: no classad from server, failing 03/28/21 22:14:29 ERROR: SECMAN:2007:Failed to end classad message. 03/28/21 22:14:29 Failed to start non-blocking update to <schedd_IP_here>:9168, 03/28/21 22:14:33 DefaultReaper unexpectedly cal03/29/21 14:04:03 ******************************************** ********** 03/29/21 14:04:03 ** condor_master (CONDOR_MASTER) STARTING UP 03/29/21 14:04:03 ** /usr/sbin/condor_master 03/29/21 14:04:03 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 03/29/21 14:04:03 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON 03/29/21 14:04:03 ** $CondorVersion: 8.8.13 Mar 23 2021 BuildID: 534541 PackageID: 8.8.13-1 $ 03/29/21 14:04:03 ** $CondorPlatform: x86_64_CentOS7 $ 03/29/21 14:04:03 ** PID = 1344 03/29/21 14:04:03 ** Log last touched 3/29 14:01:34 03/29/21 14:04:03 ****************************************************** 03/29/21 14:04:03 Using config source: /etc/condor/condor_config 03/29/21 14:04:03 Using local config sources: 03/29/21 14:04:03
/etc/condor/condor_config.local 03/29/21 14:04:03 config Macros = 70, Sorted = 70, StringBytes = 1827, TablesBytes = 2568 03/29/21 14:04:03 CLASSAD_CACHING is OFF 03/29/21 14:04:03 Daemon Log is logging: D_ALWAYS D_ERROR 03/29/21 14:04:04 SharedPortEndpoint: waiting for connections to named socket 1344_7360 03/29/21 14:04:04 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or director y 03/29/21 14:04:04 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s. 03/29/21 14:04:04 DaemonCore: private command socket at < schedd_IP_here:0?sock=1344_7360> 03/29/21 14:04:04 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_I NCLUDE_SHARED_PORT_IN_DAEMON_LIST=False) 03/29/21 14:04:04 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port 03/29/21 14:04:04 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1616514849) 03/29/21 14:04:04 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 2077 03/29/21 14:04:04 Waiting for /var/lock/condor/shared_port_ad to appear. 03/29/21 14:04:05 Found /var/lock/condor/shared_port_ad. 03/29/21 14:04:05 DaemonCore: ERROR: Can't open address file /var/log/condor/.master_address.new 03/29/21 14:04:05 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 2815 03/29/21 14:04:05 Waiting for /var/log/condor/.collector_address to appear. 03/29/21 14:04:06 Waiting for /var/log/condor/.collector_address to appear. That last line about waiting for .collector_address to appear is now just filing up the MasterLog – writing once per second. Seems like permissions somehow, but I don’t see how this could have changed on its own. Any ideas? Mike Michael Fienen, Ph. D. |