[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] FW: Configuring Ubuntu22 cluster, missing config step?



 

Thanks for the detailed suggestions Todd!

 

It's interesting to learn I've been writing /etc/hosts wrong for 20 years. Yikes!

 

System still doesn't work, any further suggestions appreciated.

 

Here's what I've got (now) in /etc/hosts (fixed!)

 

nathan@physlin6:~$ cat /etc/hosts

199.17.158.2    physlin2.physics.winona.edu     physlin2 # condor central manager

199.17.158.6    physlin6.physics.winona.edu     physlin6 # condor submit

 

199.17.158.11   physlin11.physics.winona.edu    physlin11 # condor exec

199.17.158.12   physlin12.physics.winona.edu    physlin12

199.17.158.13   physlin13.physics.winona.edu    physlin13

199.17.158.14   physlin14.physics.winona.edu    physlin14

199.17.158.15   physlin15.physics.winona.edu    physlin15

199.17.158.16   physlin16.physics.winona.edu    physlin16

199.17.158.17   physlin17.physics.winona.edu    physlin17

199.17.158.18   physlin18.physics.winona.edu    physlin18

 

199.17.158.20   physlin20.physics.winona.edu    physlin20 # condor submit

199.17.158.21   physlin21.physics.winona.edu    physlin21

199.17.158.22   physlin22.physics.winona.edu    physlin22

199.17.158.23   physlin23.physics.winona.edu    physlin23

199.17.158.24   physlin24.physics.winona.edu    physlin24

199.17.158.25   physlin25.physics.winona.edu    physlin25

199.17.158.26   physlin26.physics.winona.edu    physlin26

 

127.0.0.1âââlocalhost

 

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

 

 

looking at files first

 

nathan@physlin6:~$ cat /etc/nsswitch.conf

# /etc/nsswitch.conf

#

# Example configuration of GNU Name Service Switch functionality.

# If you have the `glibc-doc-reference' and `info' packages installed, try:

# `info libc "Name Service Switch"' for information about this file.

 

passwd:         files systemd

group:          files systemd

shadow:         files

gshadow:        files

 

hosts:          files mdns4_minimal [NOTFOUND=return] dns

networks:       files

 

protocols:      db files

services:       db files

ethers:         db files

rpc:            db files

 

netgroup:       nis

 

domain names seem to resolve

 

nathan@physlin20:~$ nslookup physlin2

Server:âââââââââââ127.0.0.53

Address:ââââ127.0.0.53#53

 

Name:âphyslin2

Address: 199.17.158.2

 

reverse lookup seems to work

 

nathan@physlin20:~$ nslookup 199.17.158.2

2.158.17.199.in-addr.arpaâââââname = physlin2.physics.winona.edu.

2.158.17.199.in-addr.arpaâââââname = physlin2.

 

nathan@physlin20:~$ host 199.17.158.2

2.158.17.199.in-addr.arpa domain name pointer physlin2.physics.winona.edu.

2.158.17.199.in-addr.arpa domain name pointer physlin2.

 

Afer re-installing condor on the nodes, restarting, on the submit machines the condor_status command still lists no machines available

 

nathan@physlin6:~$ condor_q

 

 

-- Schedd: physlin6.physics.winona.edu : <199.17.158.6:9618?... @ 03/28/23 08:27:22

OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

 

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Total for nathan: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

 

nathan@physlin6:~$ condor_status

nathan@physlin6:~$

 

It looks like the appropriate ports are open

 

nathan@physlin6:~$ nmap physlin6

Starting Nmap 7.80 ( https://nmap.org ) at 2023-03-28 08:30 CDT

Nmap scan report for physlin6 (199.17.158.6)

Host is up (0.00014s latency).

rDNS record for 199.17.158.6: physlin6.physics.winona.edu

Not shown: 995 closed ports

PORT     STATE SERVICE

22/tcp   open  ssh

80/tcp   open  http

8651/tcp open  unknown

8652/tcp open  unknown

9618/tcp open  condor

 

Nmap done: 1 IP address (1 host up) scanned in 0.07 seconds

nathan@physlin6:~$ nmap physlin2

Starting Nmap 7.80 ( https://nmap.org ) at 2023-03-28 08:31 CDT

Nmap scan report for physlin2 (199.17.158.2)

Host is up (0.00039s latency).

rDNS record for 199.17.158.2: physlin2.physics.winona.edu

Not shown: 996 closed ports

PORT     STATE SERVICE

22/tcp   open  ssh

111/tcp  open  rpcbind

8649/tcp open  unknown

9618/tcp open  condor

 

Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds

 

the condor service seems to run ok on a submit node

 

âââââânathan@physlin6:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Tue 2023-03-28 08:25:47 CDT; 6min ago

   Main PID: 1046 (condor_master)

     Status: "All daemons are responding"

      Tasks: 4 (limit: 4194303)

     Memory: 17.0M

        CPU: 253ms

     CGroup: /system.slice/condor.service

             ââ1046 /usr/sbin/condor_master -f

             ââ1151 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

             ââ1152 condor_shared_port

             ââ1153 condor_schedd

 

Mar 28 08:25:47 physlin6 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:25:48 physlin6 htcondor[1076]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).

Mar 28 08:25:48 physlin6 htcondor[1099]: Changing LOCAL_PORT_RANGE (/proc/sys/net/ipv4/ip_local_port_range) from 32768        60999 to 1024 65535

Mar 28 08:25:48 physlin6 htcondor[1105]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).

Mar 28 08:25:48 physlin6 htcondor[1121]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760

 

also ok on central manager

 

nathan@physlin2:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Tue 2023-03-28 08:26:02 CDT; 7min ago

   Main PID: 1016 (condor_master)

     Status: "All daemons are responding"

      Tasks: 5 (limit: 4194303)

     Memory: 19.0M

        CPU: 461ms

     CGroup: /system.slice/condor.service

             ââ1016 /usr/sbin/condor_master -f

             ââ1126 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

             ââ1127 condor_shared_port

             ââ1128 condor_collector

             ââ1136 condor_negotiator

 

Mar 28 08:26:02 physlin2 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:26:02 physlin2 htcondor[1081]: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value (1000000).

Mar 28 08:26:02 physlin2 htcondor[1095]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760

 

Not sure if condor is running ok on the execute.compute node though?

 

nathan@physlin11:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Tue 2023-03-28 08:25:39 CDT; 8min ago

   Main PID: 1131 (condor_master)

     Status: "Problems: "

      Tasks: 3 (limit: 4194303)

     Memory: 21.7M

        CPU: 4.584s

     CGroup: /system.slice/condor.service

             ââ1131 /usr/sbin/condor_master -f

             ââ1342 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

             ââ1343 condor_shared_port

 

Mar 28 08:25:39 physlin11 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:25:39 physlin11 htcondor[1145]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).

Mar 28 08:25:39 physlin11 htcondor[1155]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).

Mar 28 08:25:39 physlin11 htcondor[1161]: Not changing ROOT_MAXKEYS_BYTES (/proc/sys/kernel/keys/root_maxbytes): new value (25000000) <= old value (250>

Mar 28 08:25:39 physlin11 htcondor[1164]: Changing PIPE_USER_PAGES_SOFT (/proc/sys/fs/pipe-user-pages-soft) from 16384 to 131072

 

 

 

- - - -

Nathan Moore

Professor of Physics

Winona State University

 

 


From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Sent: Monday, March 27, 2023 5:17 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Moore, Nathan T <nmoore@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Configuring Ubuntu22 cluster, missing config step?

 

On 3/22/2023 9:15 PM, Moore, Nathan T via HTCondor-users wrote:

Iâm configuring condor on a small cluster of linux boxes.  Machines are on an isolated network.  IPâs are static and Iâm using /etc/hosts instead of dns for hosts to resolve machine name to IP. 

 

It seems like I missed a configure step in the install/configure procedure. Suggestions appreciated.

 


Hi Nathan,

Some suggestions:

1. When using host names, HTCondor (like a lot of other internet software) really wants to see fully qualified domain names (FQDN) as the first entry in each line /etc/hosts, and aliases that are not fully qualified can follow.  Ie the first entry on each line should have a host.domain.edu.  For instance, for an entry in /etc/hosts should look like this:
  
    199.17.158.6 physlin2.winona.edu physlin2

and not like this:


    199.17.158.6 physlin2 physlin2.winona.edu

2. If for some reason you cannot edit/change the /etc/hosts file, or it is still broken, you can set the DEFAULT_DOMAIN_NAME HTCondor config knob.  See https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#DEFAULT_DOMAIN_NAME    Example:

    # echo "DEFAULT_DOMAIN_NAME = winona.edu" > /etc/condor/config.d/15-SetDomain.conf

3. Confirm that your /etc/nsswitch.conf file uses files for host lookups, i.e. it should have a line similar to "hosts: files dns"

Hope the above helps,
Todd