Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor_status shows nothing
- Date: Tue, 19 Mar 2019 13:47:19 +0000
- From: John M Knoeller <johnkn@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] condor_status shows nothing
In your MasterLog this
03/18/19 11:31:48 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
03/18/19 11:31:48 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
is followed a second later by this
03/18/19 11:31:49 Found /var/lock/condor/shared_port_ad.
So. no. not a problem. When the Master starts up, it starts the SharedPort daemon, and then has to wait for the shared_port_ad
to appear before starting the other daemons.
Also, condor_status will show nothing if there are no Startds in your pool that are configured to send ads to this collector, or if this
collector is refusing their ads.
Use
condor_status -all
To see all of the ads in the collector, not just Startd ads.
Check the CollectorLog to see if it is refusing to accept any ads.
Use
condor_config_val -dump ALLOW_
To see the configuration related to allowing Schedds, Startd's etc to send ads to this collector. The relevant entries are will start
with ALLOW_ADVERTISE (ALLOW_DAEMON for some ads, but not for Startd or Schedd ads)
In 8.9.0 we tightened up the default security behavior. In 8.8 you could set ALLOW_WRITE to give permission to send ads to the Collector, Because ALLOW_ADVERTISE would inherit from ALLOW_WRITE. This no longer happens in 8.9.0. See the release notes
http://research.cs.wisc.edu/htcondor/manual/v8.9.0/DevelopmentReleaseSeries89.html
-tj
-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Ben Pietras
Sent: Tuesday, March 19, 2019 6:05 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] condor_status shows nothing
Hi,
I manage two clusters, one is acting a little odd. condor_status returns nothing.
On the master node:
systemctl status condor -l
● condor.service - Condor Distributed High-Throughput-Computing
Loaded: loaded (/usr/lib/systemd/system/condor.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-03-18 11:31:47 GMT; 23h ago
Main PID: 16112 (condor_master)
Status: "All daemons are responding"
Tasks: 6 (limit: 32767)
Memory: 24.7M
CGroup: /system.slice/condor.service
├─16112 /usr/sbin/condor_master -f
├─16154 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 990
├─16155 condor_shared_port -f
├─16157 condor_collector -f
├─16158 condor_negotiator -f
└─16159 condor_schedd -f
Mar 18 11:31:47 fastpc2 systemd[1]: Started Condor Distributed High-Throughput-Computing.
[this looks OK]
tail -50 /var/log/condor/MasterLog
03/18/19 11:00:08 Preen (pid 15523) exited with status 0
03/18/19 11:31:47 Got SIGQUIT. Performing fast shutdown.
03/18/19 11:31:47 Sent SIGQUIT to COLLECTOR (pid 6428)
03/18/19 11:31:47 Sent SIGQUIT to NEGOTIATOR (pid 6432)
03/18/19 11:31:47 Sent SIGQUIT to SCHEDD (pid 6433)
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6432, status 0.
03/18/19 11:31:47 The NEGOTIATOR (pid 6432) exited with status 0
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6428, status 0.
03/18/19 11:31:47 The COLLECTOR (pid 6428) exited with status 0
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6433, status 0.
03/18/19 11:31:47 The SCHEDD (pid 6433) exited with status 0
03/18/19 11:31:47 Sent SIGTERM to SHARED_PORT (pid 6387)
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6387, status 0.
03/18/19 11:31:47 The SHARED_PORT (pid 6387) exited with status 0
03/18/19 11:31:47 All daemons are gone. Exiting.
03/18/19 11:31:47 **** condor_master (condor_MASTER) pid 5538 EXITING WITH STATUS 0
03/18/19 11:31:47 ******************************************************
03/18/19 11:31:47 ** condor_master (CONDOR_MASTER) STARTING UP
03/18/19 11:31:47 ** /usr/sbin/condor_master
03/18/19 11:31:47 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
03/18/19 11:31:47 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
03/18/19 11:31:47 ** $CondorVersion: 8.9.0 Feb 27 2019 BuildID: 462330 PackageID: 8.9.0-1 $
03/18/19 11:31:47 ** $CondorPlatform: x86_64_RedHat7 $
03/18/19 11:31:47 ** PID = 16112
03/18/19 11:31:47 ** Log last touched 3/18 11:31:47
03/18/19 11:31:47 ******************************************************
03/18/19 11:31:47 Using config source: /etc/condor/condor_config
03/18/19 11:31:47 Using local config sources:
03/18/19 11:31:47 /etc/condor/config.d/condor_master_fastpc2.config
03/18/19 11:31:47 /etc/condor/config.d/condor_master_fastpc2.config.bak
03/18/19 11:31:47 /etc/condor/condor_config.local
03/18/19 11:31:47 config Macros = 75, Sorted = 75, StringBytes = 1939, TablesBytes = 2756
03/18/19 11:31:47 CLASSAD_CACHING is OFF
03/18/19 11:31:47 Daemon Log is logging: D_ALWAYS D_ERROR
03/18/19 11:31:48 SharedPortEndpoint: waiting for connections to named socket 16112_1201
03/18/19 11:31:48 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
03/18/19 11:31:48 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
03/18/19 11:31:48 DaemonCore: private command socket at <192.168.20.12:0?sock=16112_1201>
03/18/19 11:31:48 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
03/18/19 11:31:48 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1551328706)
03/18/19 11:31:48 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 16155
03/18/19 11:31:48 Waiting for /var/lock/condor/shared_port_ad to appear.
03/18/19 11:31:49 Found /var/lock/condor/shared_port_ad.
03/18/19 11:31:49 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 16157
03/18/19 11:31:49 Waiting for /var/log/condor/.collector_address to appear.
03/18/19 11:31:50 Found /var/log/condor/.collector_address.
03/18/19 11:31:50 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 16158
03/18/19 11:31:50 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 16159
03/18/19 12:31:48 Preen pid is 16679
03/18/19 12:31:48 Preen (pid 16679) exited with status 0
[this is from yesterday, "SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory" seems ominous]
ls -ltrh /var/lock/condor/shared_port_ad
-rw-r--r-- 1 condor condor 281 Mar 19 10:56 /var/lock/condor/shared_port_ad
cat /var/lock/condor/shared_port_ad
ForkedChildrenCurrent = 0
ForkedChildrenPeak = 0
MyAddress = "<192.168.20.12:9618?addrs=192.168.20.12-9618&noUDP>"
RequestsBlocked = 0
RequestsFailed = 0
RequestsPendingCurrent = 0
RequestsPendingPeak = 2
RequestsSucceeded = 47969
SharedPortCommandSinfuls = "<192.168.20.12:9618>"
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:26:b9:5d:3e:77 brd ff:ff:ff:ff:ff:ff
inet 130.88.20.80/24 brd 130.88.20.255 scope global noprefixroute dynamic em1
valid_lft 1119485sec preferred_lft 1119485sec
inet6 fe80::226:b9ff:fe5d:3e77/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:26:b9:5d:3e:78 brd ff:ff:ff:ff:ff:ff
inet 192.168.20.12/24 brd 192.168.20.255 scope global noprefixroute em2
valid_lft forever preferred_lft forever
inet6 fe80::226:b9ff:fe5d:3e78/64 scope link
valid_lft forever preferred_lft forever
4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:73:df:89:96 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:73ff:fedf:8996/64 scope link
valid_lft forever preferred_lft forever
6: veth532a0b9@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether ca:e3:89:de:ab:6e brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::c8e3:89ff:fede:ab6e/64 scope link
valid_lft forever preferred_lft forever
cat /etc/*eleas*
NAME="Scientific Linux"
VERSION="7.6 (Nitrogen)"
ID="scientific"
ID_LIKE="rhel centos fedora"
If anyone has any suggestions / wants more info, please let me know.
Best,
Ben
----------------------------------------------------------------------------
Ben Pietras <ben.pietras@xxxxxxxxxxxxxxxx>
School of Physics and Astronomy, Tel. 0161-275-4231
The University of Manchester, Fax. 0161-275-5509
Manchester, M13 9PL.
----------------------------------------------------------------------------
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/