Re: [HTCondor-users] [EXTERNAL] Collector down and not restarting properly

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

I apologize – I found the issue. /var is full. Badly mapped disks.

Michael Fienen, Ph. D.
Research Hydrologist
United States Geological Survey
Upper Midwest Water Science Center
8505 Research Way
Middleton, WI 53562-3581
phone: 608.821.3894
https://www.usgs.gov/staff-profiles/michael-n-fienen

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Fienen, Michael N via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Date: Monday, March 29, 2021 at 5:12 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Fienen, Michael N <mnfienen@xxxxxxxx>
Subject: [EXTERNAL] [HTCondor-users] Collector down and not restarting properly

This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.

Hello Condor World! Been a minute….

We have been running rock-solid for months, but just ran into a problem where a user submitted a job and all staying in Idle state. I rebooted the schedd and getting errors for it trying to come back. From MasterLog:

03/28/21 22:14:29 DefaultReaper unexpectedly called on pid 2310192, status 11264.

03/28/21 22:14:29 The COLLECTOR (pid 2310192) exited with status 44

03/28/21 22:14:29 Sending obituary for "/usr/sbin/condor_collector"

03/28/21 22:14:29 restarting /usr/sbin/condor_collector in 10 seconds

03/28/21 22:14:29 condor_write(): Socket closed when trying to write 1513 bytes to collector <schedd_name_here>, fd is 10

03/28/21 22:14:29 Buf::write(): condor_write() failed

03/28/21 22:14:29 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector <schedd_name_here>, in non-blocking mode, errno=104 Connection reset by peer

03/28/21 22:14:29 SECMAN: no classad from server, failing

03/28/21 22:14:29 ERROR: SECMAN:2007:Failed to end classad message.

03/28/21 22:14:29 Failed to start non-blocking update to <schedd_IP_here>:9168,

03/28/21 22:14:33 DefaultReaper unexpectedly cal03/29/21 14:04:03 ********************************************

**********

03/29/21 14:04:03 ** condor_master (CONDOR_MASTER) STARTING UP

03/29/21 14:04:03 ** /usr/sbin/condor_master

03/29/21 14:04:03 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)

03/29/21 14:04:03 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON

03/29/21 14:04:03 ** $CondorVersion: 8.8.13 Mar 23 2021 BuildID: 534541 PackageID: 8.8.13-1 $

03/29/21 14:04:03 ** $CondorPlatform: x86_64_CentOS7 $

03/29/21 14:04:03 ** PID = 1344

03/29/21 14:04:03 ** Log last touched 3/29 14:01:34

03/29/21 14:04:03 ******************************************************

03/29/21 14:04:03 Using config source: /etc/condor/condor_config

03/29/21 14:04:03 Using local config sources:

03/29/21 14:04:03 /etc/condor/condor_config.local

03/29/21 14:04:03 config Macros = 70, Sorted = 70, StringBytes = 1827, TablesBytes = 2568

03/29/21 14:04:03 CLASSAD_CACHING is OFF

03/29/21 14:04:03 Daemon Log is logging: D_ALWAYS D_ERROR

03/29/21 14:04:04 SharedPortEndpoint: waiting for connections to named socket 1344_7360

03/29/21 14:04:04 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or director

03/29/21 14:04:04 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.

03/29/21 14:04:04 DaemonCore: private command socket at < schedd_IP_here:0?sock=1344_7360>

03/29/21 14:04:04 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_I

NCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)

03/29/21 14:04:04 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port

03/29/21 14:04:04 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1616514849)

03/29/21 14:04:04 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 2077

03/29/21 14:04:04 Waiting for /var/lock/condor/shared_port_ad to appear.

03/29/21 14:04:05 Found /var/lock/condor/shared_port_ad.

03/29/21 14:04:05 DaemonCore: ERROR: Can't open address file /var/log/condor/.master_address.new

03/29/21 14:04:05 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 2815

03/29/21 14:04:05 Waiting for /var/log/condor/.collector_address to appear.

03/29/21 14:04:06 Waiting for /var/log/condor/.collector_address to appear.

That last line about waiting for .collector_address to appear is now just filing up the MasterLog – writing once per second.

Seems like permissions somehow, but I don’t see how this could have changed on its own. Any ideas?

Many thanks!

Mike

Mailing List Archives

Authenticated access

Re: [HTCondor-users] [EXTERNAL] Collector down and not restarting properly