Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Help with Master Config on Rocky 9
- Date: Fri, 17 Jan 2025 22:45:43 +0000
- From: Zach McGrew <mcgrewz@xxxxxxx>
- Subject: Re: [HTCondor-users] Help with Master Config on Rocky 9
Your condor_collector keeps crashing and the condor_master restarts it. From the bottom line of your recent CollectorLog:
01/17/25 22:57:00 ERROR "POOL_HISTORY_DIR (/var/ViewHist) does not exist." at line 180 in file /var/lib/condor/execute/slot1/dir_3321389/userdir/build-CL17w9/BUILD/condor-24.3.0/src/condor_collector.V6/view_se
rver.cpp
You're not setting POOL_HISTORY_DIR in the configs you included earlier, and `condor_config_val -dump -verbose POOL_HISTORY_DIR` says that it's unset by default:
# Parameters with names that match POOL_HISTORY_DIR:
POOL_HISTORY_DIR =
# at: <Default>
# expanded:
You will need to create that directory and make it writable by HTCondor so it can write the history there, or set it to be somewhere else that is writable. The manual describes this in the HTCondorView Server section [1]. I haven't played with the View Server before, but I also don't see where you enabled it in your DAEMON_LIST? Seems like there should be a VIEW_SERVER entry there to cause it to start. But you also don't need a collector running on every node. Normally the nodes are just told where the collector is in the config by setting CONDOR_HOST, and COLLECTOR_HOST is set to CONDOR_HOST.
Since you're starting from scratch, it might be worth looking at the default settings for these systems in the get_htcondor script. The commands:
condor_config_val use role:get_htcondor_central_manager
condor_config_val use role:get_htcondor_submit
condor_config_val use role:get_htcondor_execute
Should give you the default config settings for those types of nodes. Note that they're defined in terms of other templates, so you'll want to expand those too to see what's really being set.
-Zach
Reference URLs:
1. https://htcondor.readthedocs.io/en/latest/admin-manual/cm-configuration.html#configuring-the-htcondorview-server
________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Justin Fisher <justin0419@xxxxxxxxx>
Sent: Friday, January 17, 2025 2:03 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Help with Master Config on Rocky 9
You don't often get email from justin0419@xxxxxxxxxx Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hi Arshad.
Thanks for this. I should be more careful!
Alas, this wasn't the issue. I changed it to read condor_master, but I still get the same SECMAN error.
I've added my updated logs since I'm guessing the typo doesn't help anyone.
--
Kind regards,
Justin Fisher
# condor_status
Error: communication error
SECMAN:2011:Connection closed during command authorization. Probably due to an unknown command.
MasterLog
01/17/25 22:55:42 ******************************************************
01/17/25 22:55:42 ** condor_master (CONDOR_MASTER) STARTING UP
01/17/25 22:55:42 ** /usr/sbin/condor_master
01/17/25 22:55:42 ** SubsystemInfo: name=MASTER type=MASTER(1) class=DAEMON(1)
01/17/25 22:55:42 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
01/17/25 22:55:42 ** $CondorVersion: 24.3.0 2025-01-03 BuildID: 778135 PackageID: 24.3.0-1 GitSHA: a2290360 $
01/17/25 22:55:42 ** $CondorPlatform: x86_64_AlmaLinux9 $
01/17/25 22:55:42 ** PID = 3066 RealUID = 0
01/17/25 22:55:42 ** Log last touched time unavailable (No such file or directory)
01/17/25 22:55:42 ******************************************************
01/17/25 22:55:42 Using config source: /etc/condor/condor_config
01/17/25 22:55:42 Using local config sources:
01/17/25 22:55:42 /etc/condor/config.d/00-security
01/17/25 22:55:42 /etc/condor/config.d/01-central-manager.config
01/17/25 22:55:42 /etc/condor/config.d/10-stash-plugin.conf
01/17/25 22:55:42 /etc/condor/config.d/11-torannic-central-manager.config
01/17/25 22:55:42 /etc/condor/condor_config.local
01/17/25 22:55:42 config Macros = 84, Sorted = 84, StringBytes = 2488, TablesBytes = 3096
01/17/25 22:55:42 CLASSAD_CACHING is OFF
01/17/25 22:55:42 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
01/17/25 22:55:43 SharedPortEndpoint: waiting for connections to named socket master_3066_9976
01/17/25 22:55:43 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
01/17/25 22:55:43 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
01/17/25 22:55:43 DaemonCore: private command socket at <192.168.0.117:0?alias=edamgr.torannic.com&sock=master_3066_9976<http://192.168.0.117:0/?alias=edamgr.torannic.com&sock=master_3066_9976>>
01/17/25 22:55:43 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port
01/17/25 22:55:43 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1735947933)
01/17/25 22:55:43 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 3112
01/17/25 22:55:43 Waiting for /var/lock/condor/shared_port_ad to appear.
01/17/25 22:55:43 Found /var/lock/condor/shared_port_ad.
01/17/25 22:55:43 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 3113
01/17/25 22:55:43 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 3114
01/17/25 22:55:43 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 3115
01/17/25 22:55:43 Waiting for /var/log/condor/.collector_address to appear.
01/17/25 22:55:43 Found /var/log/condor/.collector_address.
01/17/25 22:55:43 Daemons::StartAllDaemons all daemons were started
01/17/25 22:55:43 The COLLECTOR (pid 3115) exited with status 4
01/17/25 22:55:44 Sending obituary for "/usr/sbin/condor_collector"
01/17/25 22:55:44 restarting /usr/sbin/condor_collector in 10 seconds
01/17/25 22:55:44 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector edamgr.torannic.com:9618<http://edamgr.torannic.com:9618/> in non-blocking mode, errno=104 Connection reset by peer
01/17/25 22:55:44 SECMAN: Failed to read resume session response classad from server.
01/17/25 22:55:44 ERROR: SECMAN:2007:Failed to read resume session response classad from server.
01/17/25 22:55:44 Failed to start non-blocking update to <192.168.0.117:9618<http://192.168.0.117:9618/>>.
01/17/25 22:55:54 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 3143
01/17/25 22:55:54 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector edamgr.torannic.com:9618<http://edamgr.torannic.com:9618/> in non-blocking mode, errno=104 Connection reset by peer
01/17/25 22:55:54 SECMAN: Failed to read resume session response classad from server.
01/17/25 22:55:54 ERROR: SECMAN:2007:Failed to read resume session response classad from server.
01/17/25 22:55:54 Failed to start non-blocking update to <192.168.0.117:9618<http://192.168.0.117:9618/>>.
01/17/25 22:55:54 The COLLECTOR (pid 3143) exited with status 4
01/17/25 22:55:54 Sending obituary for "/usr/sbin/condor_collector"
01/17/25 22:55:54 restarting /usr/sbin/condor_collector in 11 seconds
01/17/25 22:55:54 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector edamgr.torannic.com:9618<http://edamgr.torannic.com:9618/> in non-blocking mode, errno=104 Connection reset by peer
01/17/25 22:55:54 SECMAN: Failed to read resume session response classad from server.
01/17/25 22:55:54 ERROR: SECMAN:2007:Failed to read resume session response classad from server.
01/17/25 22:55:54 Failed to start non-blocking update to <192.168.0.117:9618<http://192.168.0.117:9618/>>.
01/17/25 22:56:05 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 3155
01/17/25 22:56:05 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector edamgr.torannic.com:9618<http://edamgr.torannic.com:9618/> in non-blocking mode, errno=104 Connection reset by peer
01/17/25 22:56:05 SECMAN: Failed to read resume session response classad from server.
01/17/25 22:56:05 ERROR: SECMAN:2007:Failed to read resume session response classad from server.
01/17/25 22:56:05 Failed to start non-blocking update to <192.168.0.117:9618<http://192.168.0.117:9618/>>.
01/17/25 22:56:05 The COLLECTOR (pid 3155) exited with status 4
01/17/25 22:56:05 Sending obituary for "/usr/sbin/condor_collector"
01/17/25 22:56:05 restarting /usr/sbin/condor_collector in 13 seconds
01/17/25 22:56:05 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector edamgr.torannic.com:9618<http://edamgr.torannic.com:9618/> in non-blocking mode, errno=104 Connection reset by peer
01/17/25 22:56:05 SECMAN: Failed to read resume session response classad from server.
01/17/25 22:56:05 ERROR: SECMAN:2007:Failed to read resume session response classad from server.
01/17/25 22:56:05 Failed to start non-blocking update to <192.168.0.117:9618<http://192.168.0.117:9618/>>.
SharedPortLog
01/17/25 22:55:43 Setting maximum file descriptors to 20000.
01/17/25 22:55:43 ******************************************************
01/17/25 22:55:43 ** condor_shared_port (CONDOR_SHARED_PORT) STARTING UP
01/17/25 22:55:43 ** /usr/libexec/condor/condor_shared_port
01/17/25 22:55:43 ** SubsystemInfo: name=SHARED_PORT type=SHARED_PORT(10) class=DAEMON(1)
01/17/25 22:55:43 ** Configuration: subsystem:SHARED_PORT local:<NONE> class:DAEMON
01/17/25 22:55:43 ** $CondorVersion: 24.3.0 2025-01-03 BuildID: 778135 PackageID: 24.3.0-1 GitSHA: a2290360 $
01/17/25 22:55:43 ** $CondorPlatform: x86_64_AlmaLinux9 $
01/17/25 22:55:43 ** PID = 3112 RealUID = 0
01/17/25 22:55:43 ** Log last touched time unavailable (No such file or directory)
01/17/25 22:55:43 ******************************************************
01/17/25 22:55:43 Using config source: /etc/condor/condor_config
01/17/25 22:55:43 Using local config sources:
01/17/25 22:55:43 /etc/condor/config.d/00-security
01/17/25 22:55:43 /etc/condor/config.d/01-central-manager.config
01/17/25 22:55:43 /etc/condor/config.d/10-stash-plugin.conf
01/17/25 22:55:43 /etc/condor/config.d/11-torannic-central-manager.config
01/17/25 22:55:43 /etc/condor/condor_config.local
01/17/25 22:55:43 config Macros = 85, Sorted = 85, StringBytes = 2542, TablesBytes = 3132
01/17/25 22:55:43 CLASSAD_CACHING is ENABLED
01/17/25 22:55:43 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
01/17/25 22:55:43 Daemoncore: Listening at <0.0.0.0:9618<http://0.0.0.0:9618/>> on TCP (ReliSock).
01/17/25 22:55:43 DaemonCore: command socket at <192.168.0.117:9618?addrs=192.168.0.117-9618&alias=edamgr.torannic.com&noUDP<http://192.168.0.117:9618/?addrs=192.168.0.117-9618&alias=edamgr.torannic.com&noUDP>>
01/17/25 22:55:43 DaemonCore: private command socket at <192.168.0.117:9618?addrs=192.168.0.117-9618&alias=edamgr.torannic.com<http://192.168.0.117:9618/?addrs=192.168.0.117-9618&alias=edamgr.torannic.com>>
01/17/25 22:55:43 main_init() called
01/17/25 22:55:43 About to update statistics in shared_port daemon ad file at /var/lock/condor/shared_port_ad :
ForkedChildrenCurrent = 0
ForkedChildrenPeak = 0
MyAddress = "<192.168.0.117:9618?addrs=192.168.0.117-9618&alias=edamgr.torannic.com&noUDP<http://192.168.0.117:9618/?addrs=192.168.0.117-9618&alias=edamgr.torannic.com&noUDP>>"
RequestsBlocked = 0
RequestsFailed = 0
RequestsPendingCurrent = 0
RequestsPendingPeak = 0
RequestsSucceeded = 0
SharedPortCommandSinfuls = "<192.168.0.117:9618?alias=edamgr.torannic.com<http://192.168.0.117:9618/?alias=edamgr.torannic.com>>"
01/17/25 22:55:44 SharedPortServer: server was busy, failed to connect collector as requested by <192.168.0.117:27921<http://192.168.0.117:27921/>>: primary (<cookie>/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock
/collector): Connection refused (111)
01/17/25 22:55:44 SharedPortServer: server was busy, failed to connect collector as requested by <192.168.0.117:2875<http://192.168.0.117:2875/>>: primary (<cookie>/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/
collector): Connection refused (111)
01/17/25 22:55:44 SharedPortServer: server was busy, failed to connect collector as requested by <192.168.0.117:25439<http://192.168.0.117:25439/>>: primary (<cookie>/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock
/collector): Connection refused (111)
01/17/25 22:55:44 SharedPortServer: server was busy, failed to connect collector as requested by <192.168.0.117:15611<http://192.168.0.117:15611/>>: primary (<cookie>/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock
/collector): Connection refused (111)
01/17/25 22:55:48 SharedPortServer: server was busy, failed to connect collector as requested by <192.168.0.117:16817<http://192.168.0.117:16817/>>: primary (<cookie>/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock
/collector): Connection refused (111)
01/17/25 22:55:54 SharedPortServer: server was busy, failed to connect collector as requested by <192.168.0.117:33025<http://192.168.0.117:33025/>>: primary (<cookie>/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock
/collector): Connection refused (111)
01/17/25 22:56:05 SharedPortServer: server was busy, failed to connect collector as requested by <192.168.0.117:7037<http://192.168.0.117:7037/>>: primary (<cookie>/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/
collector): Connection refused (111)
01/17/25 22:56:18 SharedPortServer: server was busy, failed to connect collector as requested by <192.168.0.117:10165<http://192.168.0.117:10165/>>: primary (<cookie>/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock
/collector): Connection refused (111)
01/17/25 22:56:35 SharedPortServer: server was busy, failed to connect collector as requested by <192.168.0.117:32471<http://192.168.0.117:32471/>>: primary (<cookie>/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock
/collector): Connection refused (111)
01/17/25 22:56:44 SharedPortServer: server was busy, failed to connect collector as requested by <192.168.0.117:9913<http://192.168.0.117:9913/>>: primary (<cookie>/collector): Connection refused (111); alt (/var/lock/condor/daemon_sock/
collector): Connection refused (111)
CollectorLog
01/17/25 22:57:00 ******************************************************
01/17/25 22:57:00 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
01/17/25 22:57:00 ** /usr/sbin/condor_collector
01/17/25 22:57:00 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(2) class=DAEMON(1)
01/17/25 22:57:00 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON
01/17/25 22:57:00 ** $CondorVersion: 24.3.0 2025-01-03 BuildID: 778135 PackageID: 24.3.0-1 GitSHA: a2290360 $
01/17/25 22:57:00 ** $CondorPlatform: x86_64_AlmaLinux9 $
01/17/25 22:57:00 ** PID = 3199 RealUID = 0
01/17/25 22:57:00 ** Log last touched 1/17 22:56:35
01/17/25 22:57:00 ******************************************************
01/17/25 22:57:00 Using config source: /etc/condor/condor_config
01/17/25 22:57:00 Using local config sources:
01/17/25 22:57:00 /etc/condor/config.d/00-security
01/17/25 22:57:00 /etc/condor/config.d/01-central-manager.config
01/17/25 22:57:00 /etc/condor/config.d/10-stash-plugin.conf
01/17/25 22:57:00 /etc/condor/config.d/11-torannic-central-manager.config
01/17/25 22:57:00 /etc/condor/condor_config.local
01/17/25 22:57:00 config Macros = 85, Sorted = 85, StringBytes = 2538, TablesBytes = 3132
01/17/25 22:57:00 CLASSAD_CACHING is ENABLED
01/17/25 22:57:00 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
01/17/25 22:57:00 SharedPortEndpoint: waiting for connections to named socket collector
01/17/25 22:57:00 DaemonCore: non-shared command socket at <192.168.0.117:8869?alias=edamgr.torannic.com<http://192.168.0.117:8869/?alias=edamgr.torannic.com>>
01/17/25 22:57:00 Daemoncore: Listening at <0.0.0.0:8869<http://0.0.0.0:8869/>> on TCP (ReliSock) and UDP (SafeSock).
01/17/25 22:57:00 DaemonCore: command socket at <192.168.0.117:9618?addrs=192.168.0.117-9618&alias=edamgr.torannic.com&noUDP&sock=collector<http://192.168.0.117:9618/?addrs=192.168.0.117-9618&alias=edamgr.torannic.com&noUDP&sock=collector>>
01/17/25 22:57:00 DaemonCore: private command socket at <192.168.0.117:9618?addrs=192.168.0.117-9618&alias=edamgr.torannic.com&noUDP&sock=collector<http://192.168.0.117:9618/?addrs=192.168.0.117-9618&alias=edamgr.torannic.com&noUDP&sock=collector>>
01/17/25 22:57:00 In ViewServer::Init()
01/17/25 22:57:00 In CollectorDaemon::Init()
01/17/25 22:57:00 In ViewServer::Config()
01/17/25 22:57:00 In CollectorDaemon::Config()
01/17/25 22:57:00 COLLECTOR_GETAD_OPTIONS set to fast lazy-parse (0x30)
01/17/25 22:57:00 ABSENT_REQUIREMENTS = None
01/17/25 22:57:00 OfflineCollectorPlugin::configure: no persistent store was defined for off-line ads.
01/17/25 22:57:00 enable: Creating stats hash table
01/17/25 22:57:00 Enabling CCB Server.
01/17/25 22:57:00 Configuration: SAMPLING_INTERVAL=60, MAX_STORAGE=10000000, MaxFileSize=333333, POOL_HISTORY_DIR=/var/ViewHist
01/17/25 22:57:00 ERROR "POOL_HISTORY_DIR (/var/ViewHist) does not exist." at line 180 in file /var/lib/condor/execute/slot1/dir_3321389/userdir/build-CL17w9/BUILD/condor-24.3.0/src/condor_collector.V6/view_se
rver.cpp
On Fri, Jan 17, 2025 at 10:09?PM Arshad Ahmad via HTCondor-users <htcondor-users@xxxxxxxxxxx<mailto:htcondor-users@xxxxxxxxxxx>> wrote:
Hi Justin, I noticed that in your configuration, the MASTER daemon is set to $(SBIN)/condor_amster. This appears to be a typo, and you may want to update it to $(SBIN)/condor_master.
Could you please make this correction and post the error if you're still facing.
Kind regards,
Arshad Ahmad
________________________________