Thanks for the suggestion Jason.
First, one of the networking errors was (I think) the omission of the line
127.0.1.1 physlin12 (or whatever the hostname isâ)
From the /etc/hosts file. This is apparently a required bugfix that I errantly deleted from every machines /etc/hosts when installing/configuring the cluster.
After fixing this and re-installing condor with a command of the form
sudo curl -fsSL
https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="RanchoGordo23" /bin/bash -s -- --no-dry-run --execute physlin2.physics.winona.edu
I now see a new error on execute and submit nodes.
(submit)
nathan@physlin20:~$ cat /var/log/condor/* | grep ERROR:
03/28/23 12:17:53 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS
(execute)
nathan@physlin11:~$ cat /var/log/condor/* | grep ERROR:
03/28/23 12:17:11 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS
This is the error that was resolved by fixing /etc/hosts to include â127.0.1.1 whatever_the_hostname_isâ
nathan@physlin11:~$ cat /var/log/condor/MasterLog | grep ERROR
03/28/23 08:26:03 ERROR: SECMAN:2003:TCP connection to collector 199.17.158.2 failed.
- - - -
Nathan Moore
Professor of Physics
Winona State University
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Jason Patton via HTCondor-users
Sent: Tuesday, March 28, 2023 11:28 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Jason Patton <jpatton@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] FW: Configuring Ubuntu22 cluster, missing config step?
Hi Nathan,
On one of the problem execute machines, is there anything obvious/useful in /var/log/condor/StartLog? (Maybe also check the MasterLog or ProcLog.) From the systemctl output, it looks like the startd daemon is not running, but the master,
procd, and shared port daemons are able to run.
Thanks for the detailed suggestions Todd!
It's interesting to learn I've been writing /etc/hosts wrong for 20 years. Yikes!
System still doesn't work, any further suggestions appreciated.
Here's what I've got (now) in /etc/hosts (fixed!)
nathan@physlin6:~$ cat /etc/hosts
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
nathan@physlin6:~$ cat /etc/nsswitch.conf
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.
hosts: files mdns4_minimal [NOTFOUND=return] dns
netgroup: nis
domain names seem to resolve
nathan@physlin20:~$ nslookup physlin2
Server:âââââââââââ127.0.0.53
Address:ââââ127.0.0.53#53
reverse lookup seems to work
nathan@physlin20:~$ nslookup 199.17.158.2
2.158.17.199.in-addr.arpaâââââname = physlin2.
nathan@physlin20:~$ host 199.17.158.2
2.158.17.199.in-addr.arpa domain name pointer physlin2.
Afer re-installing condor on the nodes, restarting, on the submit machines the condor_status command still lists no machines available
nathan@physlin6:~$ condor_q
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for nathan: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
nathan@physlin6:~$ condor_status
It looks like the appropriate ports are open
nathan@physlin6:~$ nmap physlin6
Nmap scan report for physlin6 (199.17.158.6)
Host is up (0.00014s latency).
Not shown: 995 closed ports
Nmap done: 1 IP address (1 host up) scanned in 0.07 seconds
nathan@physlin6:~$ nmap physlin2
Nmap scan report for physlin2 (199.17.158.2)
Host is up (0.00039s latency).
Not shown: 996 closed ports
Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds
the condor service seems to run ok on a submit node
âââââânathan@physlin6:~$ systemctl status condor
â condor.service - Condor Distributed High-Throughput-Computing
Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-03-28 08:25:47 CDT; 6min ago
Main PID: 1046 (condor_master)
Status: "All daemons are responding"
Tasks: 4 (limit: 4194303)
CGroup: /system.slice/condor.service
ââ1046
/usr/sbin/condor_master -f
ââ1151
condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130
ââ1152
condor_shared_port
Mar 28 08:25:47 physlin6 systemd[1]: Started Condor Distributed High-Throughput-Computing.
Mar 28 08:25:48 physlin6 htcondor[1076]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).
Mar 28 08:25:48 physlin6 htcondor[1099]: Changing LOCAL_PORT_RANGE (/proc/sys/net/ipv4/ip_local_port_range) from 32768 60999 to
1024 65535
Mar 28 08:25:48 physlin6 htcondor[1105]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).
Mar 28 08:25:48 physlin6 htcondor[1121]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760
also ok on central manager
nathan@physlin2:~$ systemctl status condor
â condor.service - Condor Distributed High-Throughput-Computing
Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-03-28 08:26:02 CDT; 7min ago
Main PID: 1016 (condor_master)
Status: "All daemons are responding"
Tasks: 5 (limit: 4194303)
CGroup: /system.slice/condor.service
ââ1016
/usr/sbin/condor_master -f
ââ1126
condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130
ââ1127
condor_shared_port
Mar 28 08:26:02 physlin2 systemd[1]: Started Condor Distributed High-Throughput-Computing.
Mar 28 08:26:02 physlin2 htcondor[1081]: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value
(1000000).
Mar 28 08:26:02 physlin2 htcondor[1095]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760
Not sure if condor is running ok on the execute.compute node though?
nathan@physlin11:~$ systemctl status condor
â condor.service - Condor Distributed High-Throughput-Computing
Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-03-28 08:25:39 CDT; 8min ago
Main PID: 1131 (condor_master)
Tasks: 3 (limit: 4194303)
CGroup: /system.slice/condor.service
ââ1131
/usr/sbin/condor_master -f
ââ1342
condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130
ââ1343 condor_shared_port
Mar 28 08:25:39 physlin11 systemd[1]: Started Condor Distributed High-Throughput-Computing.
Mar 28 08:25:39 physlin11 htcondor[1145]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).
Mar 28 08:25:39 physlin11 htcondor[1155]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value
(4096).
Mar 28 08:25:39 physlin11 htcondor[1161]: Not changing ROOT_MAXKEYS_BYTES (/proc/sys/kernel/keys/root_maxbytes): new value (25000000)
<= old value (250>
Mar 28 08:25:39 physlin11 htcondor[1164]: Changing PIPE_USER_PAGES_SOFT (/proc/sys/fs/pipe-user-pages-soft) from 16384 to 131072
- - - -
Nathan Moore
Professor of Physics
Winona State University
On 3/22/2023 9:15 PM, Moore, Nathan T via HTCondor-users wrote:
Iâm configuring condor on a small cluster of linux boxes. Machines are on an isolated network. IPâs are static and Iâm using /etc/hosts instead of dns for hosts to resolve machine name to IP.
It seems like I missed a configure step in the install/configure procedure. Suggestions appreciated.
Hi Nathan,
Some suggestions:
1. When using host names, HTCondor (like a lot of other internet software) really wants to see fully qualified domain names (FQDN) as the
first entry in each line /etc/hosts, and aliases that are not fully qualified can follow. Ie the first entry on each line should have a
host.domain.edu. For instance, for an entry in /etc/hosts should look like this:
199.17.158.6 physlin2.winona.edu
physlin2
and not like this:
199.17.158.6 physlin2
physlin2.winona.edu
2. If for some reason you cannot edit/change the /etc/hosts file, or it is still broken, you can set the DEFAULT_DOMAIN_NAME HTCondor config knob. See
https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#DEFAULT_DOMAIN_NAME Example:
# echo "DEFAULT_DOMAIN_NAME =
winona.edu" > /etc/condor/config.d/15-SetDomain.conf
3. Confirm that your /etc/nsswitch.conf file uses files for host lookups, i.e. it should have a line similar to "hosts: files dns"
Hope the above helps,
Todd
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
|