_______________________________________________Â
Thanks for the detailed suggestions Todd!
Â
It's interesting to learn I've been writing /etc/hosts wrong for 20 years. Yikes!
Â
System still doesn't work, any further suggestions appreciated.
Â
Here's what I've got (now) in /etc/hosts (fixed!)
Â
nathan@physlin6:~$ cat /etc/hosts
199.17.158.2  Âphyslin2.physics.winona.edu   physlin2 # condor central manager
199.17.158.6  Âphyslin6.physics.winona.edu   physlin6 # condor submit
Â
199.17.158.11  physlin11.physics.winona.edu  Âphyslin11 # condor exec
199.17.158.12  physlin12.physics.winona.edu  Âphyslin12
199.17.158.13  physlin13.physics.winona.edu  Âphyslin13
199.17.158.14  physlin14.physics.winona.edu  Âphyslin14
199.17.158.15  physlin15.physics.winona.edu  Âphyslin15
199.17.158.16  physlin16.physics.winona.edu  Âphyslin16
199.17.158.17  physlin17.physics.winona.edu  Âphyslin17
199.17.158.18  physlin18.physics.winona.edu  Âphyslin18
Â
199.17.158.20  physlin20.physics.winona.edu  Âphyslin20 # condor submit
199.17.158.21  physlin21.physics.winona.edu  Âphyslin21
199.17.158.22  physlin22.physics.winona.edu  Âphyslin22
199.17.158.23  physlin23.physics.winona.edu  Âphyslin23
199.17.158.24  physlin24.physics.winona.edu  Âphyslin24
199.17.158.25  physlin25.physics.winona.edu  Âphyslin25
199.17.158.26  physlin26.physics.winona.edu  Âphyslin26
Â
127.0.0.1âââlocalhost
Â
# The following lines are desirable for IPv6 capable hosts
::1 Â Â ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Â
Â
looking at files first
Â
nathan@physlin6:~$ cat /etc/nsswitch.conf
# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.
Â
passwd: Â Â Â Â files systemd
group: Â Â Â Â Âfiles systemd
shadow: Â Â Â Â files
gshadow: Â Â Â Âfiles
Â
hosts: Â Â Â Â Âfiles mdns4_minimal [NOTFOUND=return] dns
networks: Â Â Â files
Â
protocols: Â Â Âdb files
services: Â Â Â db files
ethers: Â Â Â Â db files
rpc: Â Â Â Â Â Âdb files
Â
netgroup: Â Â Â nis
Â
domain names seem to resolve
Â
nathan@physlin20:~$ nslookup physlin2
Server:âââââââââââ127.0.0.53
Address:ââââ127.0.0.53#53
Â
Name:âphyslin2
Address: 199.17.158.2
Â
reverse lookup seems to work
Â
nathan@physlin20:~$ nslookup 199.17.158.2
2.158.17.199.in-addr.arpaâââââname = physlin2.physics.winona.edu.
2.158.17.199.in-addr.arpaâââââname = physlin2.
Â
nathan@physlin20:~$ host 199.17.158.2
2.158.17.199.in-addr.arpa domain name pointer physlin2.physics.winona.edu.
2.158.17.199.in-addr.arpa domain name pointer physlin2.
Â
Afer re-installing condor on the nodes, restarting, on the submit machines the condor_status command still lists no machines available
Â
nathan@physlin6:~$ condor_q
Â
Â
-- Schedd: physlin6.physics.winona.edu : <199.17.158.6:9618?... @ 03/28/23 08:27:22
OWNER BATCH_NAME Â Â ÂSUBMITTED Â DONE Â RUN Â ÂIDLE Â HOLD ÂTOTAL JOB_IDS
Â
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for nathan: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Â
nathan@physlin6:~$ condor_status
nathan@physlin6:~$
Â
It looks like the appropriate ports are open
Â
nathan@physlin6:~$ nmap physlin6
Starting Nmap 7.80 ( https://nmap.org ) at 2023-03-28 08:30 CDT
Nmap scan report for physlin6 (199.17.158.6)
Host is up (0.00014s latency).
rDNS record for 199.17.158.6: physlin6.physics.winona.edu
Not shown: 995 closed ports
PORT Â Â STATE SERVICE
22/tcp  open Âssh
80/tcp  open Âhttp
8651/tcp open Âunknown
8652/tcp open Âunknown
9618/tcp open Âcondor
Â
Nmap done: 1 IP address (1 host up) scanned in 0.07 seconds
nathan@physlin6:~$ nmap physlin2
Starting Nmap 7.80 ( https://nmap.org ) at 2023-03-28 08:31 CDT
Nmap scan report for physlin2 (199.17.158.2)
Host is up (0.00039s latency).
rDNS record for 199.17.158.2: physlin2.physics.winona.edu
Not shown: 996 closed ports
PORT Â Â STATE SERVICE
22/tcp  open Âssh
111/tcp Âopen Ârpcbind
8649/tcp open Âunknown
9618/tcp open Âcondor
Â
Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds
Â
the condor service seems to run ok on a submit node
Â
âââââânathan@physlin6:~$ systemctl status condor
â condor.service - Condor Distributed High-Throughput-Computing
  ÂLoaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)
  ÂActive: active (running) since Tue 2023-03-28 08:25:47 CDT; 6min ago
 ÂMain PID: 1046 (condor_master)
  ÂStatus: "All daemons are responding"
   Tasks: 4 (limit: 4194303)
  ÂMemory: 17.0M
    CPU: 253ms
  ÂCGroup: /system.slice/condor.service
      Âââ1046 /usr/sbin/condor_master -f
      Âââ1151 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130
      Âââ1152 condor_shared_port
      Âââ1153 condor_schedd
Â
Mar 28 08:25:47 physlin6 systemd[1]: Started Condor Distributed High-Throughput-Computing.
Mar 28 08:25:48 physlin6 htcondor[1076]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).
Mar 28 08:25:48 physlin6 htcondor[1099]: Changing LOCAL_PORT_RANGE (/proc/sys/net/ipv4/ip_local_port_range) from 32768 Â Â Â Â60999 to 1024 65535
Mar 28 08:25:48 physlin6 htcondor[1105]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).
Mar 28 08:25:48 physlin6 htcondor[1121]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760
Â
also ok on central manager
Â
nathan@physlin2:~$ systemctl status condor
â condor.service - Condor Distributed High-Throughput-Computing
  ÂLoaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)
  ÂActive: active (running) since Tue 2023-03-28 08:26:02 CDT; 7min ago
 ÂMain PID: 1016 (condor_master)
  ÂStatus: "All daemons are responding"
   Tasks: 5 (limit: 4194303)
  ÂMemory: 19.0M
    CPU: 461ms
  ÂCGroup: /system.slice/condor.service
      Âââ1016 /usr/sbin/condor_master -f
      Âââ1126 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130
      Âââ1127 condor_shared_port
      Âââ1128 condor_collector
      Âââ1136 condor_negotiator
Â
Mar 28 08:26:02 physlin2 systemd[1]: Started Condor Distributed High-Throughput-Computing.
Mar 28 08:26:02 physlin2 htcondor[1081]: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value (1000000).
Mar 28 08:26:02 physlin2 htcondor[1095]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760
Â
Not sure if condor is running ok on the execute.compute node though?
Â
nathan@physlin11:~$ systemctl status condor
â condor.service - Condor Distributed High-Throughput-Computing
  ÂLoaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)
  ÂActive: active (running) since Tue 2023-03-28 08:25:39 CDT; 8min ago
 ÂMain PID: 1131 (condor_master)
  ÂStatus: "Problems: "
   Tasks: 3 (limit: 4194303)
  ÂMemory: 21.7M
    CPU: 4.584s
  ÂCGroup: /system.slice/condor.service
      Âââ1131 /usr/sbin/condor_master -f
      Âââ1342 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130
      Âââ1343 condor_shared_port
Â
Mar 28 08:25:39 physlin11 systemd[1]: Started Condor Distributed High-Throughput-Computing.
Mar 28 08:25:39 physlin11 htcondor[1145]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).
Mar 28 08:25:39 physlin11 htcondor[1155]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).
Mar 28 08:25:39 physlin11 htcondor[1161]: Not changing ROOT_MAXKEYS_BYTES (/proc/sys/kernel/keys/root_maxbytes): new value (25000000) <= old value (250>
Mar 28 08:25:39 physlin11 htcondor[1164]: Changing PIPE_USER_PAGES_SOFT (/proc/sys/fs/pipe-user-pages-soft) from 16384 to 131072
Â
Â
Â
- - - -
Nathan Moore
Professor of Physics
Winona State University
Â
Â
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Sent: Monday, March 27, 2023 5:17 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Moore, Nathan T <nmoore@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Configuring Ubuntu22 cluster, missing config step?Â
On 3/22/2023 9:15 PM, Moore, Nathan T via HTCondor-users wrote:
Iâm configuring condor on a small cluster of linux boxes. Machines are on an isolated network. IPâs are static and Iâm using /etc/hosts instead of dns for hosts to resolve machine name to IP.Â
Â
It seems like I missed a configure step in the install/configure procedure. Suggestions appreciated.
Â
Hi Nathan,
Some suggestions:
1. When using host names, HTCondor (like a lot of other internet software) really wants to see fully qualified domain names (FQDN) as the first entry in each line /etc/hosts, and aliases that are not fully qualified can follow. Ie the first entry on each line should have a host.domain.edu. For instance, for an entry in /etc/hosts should look like this:
ÂÂ
ÂÂÂ 199.17.158.6 physlin2.winona.edu physlin2
and not like this:
  199.17.158.6 physlin2 physlin2.winona.edu
2. If for some reason you cannot edit/change the /etc/hosts file, or it is still broken, you can set the DEFAULT_DOMAIN_NAME HTCondor config knob. See https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#DEFAULT_DOMAIN_NAME Example:
ÂÂÂ # echo "DEFAULT_DOMAIN_NAME = winona.edu" > /etc/condor/config.d/15-SetDomain.conf
3. Confirm that your /etc/nsswitch.conf file uses files for host lookups, i.e. it should have a line similar to "hosts: files dns"
Hope the above helps,
Todd
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/