[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] FW: Configuring Ubuntu22 cluster, missing config step?



Thanks for the suggestion Jason. 

 

First, one of the networking errors was (I think) the omission of the line

127.0.1.1 physlin12         (or whatever the hostname isâ)

From the /etc/hosts file.  This is apparently a required bugfix that I errantly deleted from every machines /etc/hosts when installing/configuring the cluster.

 

After fixing this and re-installing condor with a command of the form

 

sudo curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="RanchoGordo23" /bin/bash -s -- --no-dry-run --execute physlin2.physics.winona.edu

 

I now see a new error on execute and submit nodes.

 

(submit)

nathan@physlin20:~$ cat /var/log/condor/* | grep ERROR:

03/28/23 12:17:53 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS

(execute)

nathan@physlin11:~$ cat /var/log/condor/* | grep ERROR:

03/28/23 12:17:11 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS

 

 

 

This is the error that was resolved by fixing /etc/hosts to include â127.0.1.1               whatever_the_hostname_isâ

               nathan@physlin11:~$ cat /var/log/condor/MasterLog | grep ERROR

03/28/23 08:26:03 ERROR: SECMAN:2003:TCP connection to collector 199.17.158.2 failed.

 

 

- - - -

Nathan Moore

Professor of Physics

Winona State University

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jason Patton via HTCondor-users
Sent: Tuesday, March 28, 2023 11:28 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Jason Patton <jpatton@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] FW: Configuring Ubuntu22 cluster, missing config step?

 

Hi Nathan,

 

On one of the problem execute machines, is there anything obvious/useful in /var/log/condor/StartLog? (Maybe also check the MasterLog or ProcLog.) From the systemctl output, it looks like the startd daemon is not running, but the master, procd, and shared port daemons are able to run.

 

Jason Patton

 

On Tue, Mar 28, 2023 at 11:20âAM Moore, Nathan T via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

 

Thanks for the detailed suggestions Todd!

 

It's interesting to learn I've been writing /etc/hosts wrong for 20 years. Yikes!

 

System still doesn't work, any further suggestions appreciated.

 

Here's what I've got (now) in /etc/hosts (fixed!)

 

nathan@physlin6:~$ cat /etc/hosts

199.17.158.2    physlin2.physics.winona.edu     physlin2 # condor central manager

199.17.158.6    physlin6.physics.winona.edu     physlin6 # condor submit

 

199.17.158.11   physlin11.physics.winona.edu    physlin11 # condor exec

199.17.158.12   physlin12.physics.winona.edu    physlin12

199.17.158.13   physlin13.physics.winona.edu    physlin13

199.17.158.14   physlin14.physics.winona.edu    physlin14

199.17.158.15   physlin15.physics.winona.edu    physlin15

199.17.158.16   physlin16.physics.winona.edu    physlin16

199.17.158.17   physlin17.physics.winona.edu    physlin17

199.17.158.18   physlin18.physics.winona.edu    physlin18

 

199.17.158.20   physlin20.physics.winona.edu    physlin20 # condor submit

199.17.158.21   physlin21.physics.winona.edu    physlin21

199.17.158.22   physlin22.physics.winona.edu    physlin22

199.17.158.23   physlin23.physics.winona.edu    physlin23

199.17.158.24   physlin24.physics.winona.edu    physlin24

199.17.158.25   physlin25.physics.winona.edu    physlin25

199.17.158.26   physlin26.physics.winona.edu    physlin26

 

127.0.0.1âââlocalhost

 

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

 

 

looking at files first

 

nathan@physlin6:~$ cat /etc/nsswitch.conf

# /etc/nsswitch.conf

#

# Example configuration of GNU Name Service Switch functionality.

# If you have the `glibc-doc-reference' and `info' packages installed, try:

# `info libc "Name Service Switch"' for information about this file.

 

passwd:         files systemd

group:          files systemd

shadow:         files

gshadow:        files

 

hosts:          files mdns4_minimal [NOTFOUND=return] dns

networks:       files

 

protocols:      db files

services:       db files

ethers:         db files

rpc:            db files

 

netgroup:       nis

 

domain names seem to resolve

 

nathan@physlin20:~$ nslookup physlin2

Server:âââââââââââ127.0.0.53

Address:ââââ127.0.0.53#53

 

Name:âphyslin2

Address: 199.17.158.2

 

reverse lookup seems to work

 

nathan@physlin20:~$ nslookup 199.17.158.2

2.158.17.199.in-addr.arpaâââââname = physlin2.physics.winona.edu.

2.158.17.199.in-addr.arpaâââââname = physlin2.

 

nathan@physlin20:~$ host 199.17.158.2

2.158.17.199.in-addr.arpa domain name pointer physlin2.physics.winona.edu.

2.158.17.199.in-addr.arpa domain name pointer physlin2.

 

Afer re-installing condor on the nodes, restarting, on the submit machines the condor_status command still lists no machines available

 

nathan@physlin6:~$ condor_q

 

 

-- Schedd: physlin6.physics.winona.edu : <199.17.158.6:9618?... @ 03/28/23 08:27:22

OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

 

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Total for nathan: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

 

nathan@physlin6:~$ condor_status

nathan@physlin6:~$

 

It looks like the appropriate ports are open

 

nathan@physlin6:~$ nmap physlin6

Starting Nmap 7.80 ( https://nmap.org ) at 2023-03-28 08:30 CDT

Nmap scan report for physlin6 (199.17.158.6)

Host is up (0.00014s latency).

Not shown: 995 closed ports

PORT     STATE SERVICE

22/tcp   open  ssh

80/tcp   open  http

8651/tcp open  unknown

8652/tcp open  unknown

9618/tcp open  condor

 

Nmap done: 1 IP address (1 host up) scanned in 0.07 seconds

nathan@physlin6:~$ nmap physlin2

Starting Nmap 7.80 ( https://nmap.org ) at 2023-03-28 08:31 CDT

Nmap scan report for physlin2 (199.17.158.2)

Host is up (0.00039s latency).

Not shown: 996 closed ports

PORT     STATE SERVICE

22/tcp   open  ssh

111/tcp  open  rpcbind

8649/tcp open  unknown

9618/tcp open  condor

 

Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds

 

the condor service seems to run ok on a submit node

 

âââââânathan@physlin6:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Tue 2023-03-28 08:25:47 CDT; 6min ago

   Main PID: 1046 (condor_master)

     Status: "All daemons are responding"

      Tasks: 4 (limit: 4194303)

     Memory: 17.0M

        CPU: 253ms

     CGroup: /system.slice/condor.service

             ââ1046 /usr/sbin/condor_master -f

             ââ1151 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

             ââ1152 condor_shared_port

             ââ1153 condor_schedd

 

Mar 28 08:25:47 physlin6 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:25:48 physlin6 htcondor[1076]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).

Mar 28 08:25:48 physlin6 htcondor[1099]: Changing LOCAL_PORT_RANGE (/proc/sys/net/ipv4/ip_local_port_range) from 32768        60999 to 1024 65535

Mar 28 08:25:48 physlin6 htcondor[1105]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).

Mar 28 08:25:48 physlin6 htcondor[1121]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760

 

also ok on central manager

 

nathan@physlin2:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Tue 2023-03-28 08:26:02 CDT; 7min ago

   Main PID: 1016 (condor_master)

     Status: "All daemons are responding"

      Tasks: 5 (limit: 4194303)

     Memory: 19.0M

        CPU: 461ms

     CGroup: /system.slice/condor.service

             ââ1016 /usr/sbin/condor_master -f

             ââ1126 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

             ââ1127 condor_shared_port

             ââ1128 condor_collector

             ââ1136 condor_negotiator

 

Mar 28 08:26:02 physlin2 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:26:02 physlin2 htcondor[1081]: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value (1000000).

Mar 28 08:26:02 physlin2 htcondor[1095]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760

 

Not sure if condor is running ok on the execute.compute node though?

 

nathan@physlin11:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Tue 2023-03-28 08:25:39 CDT; 8min ago

   Main PID: 1131 (condor_master)

     Status: "Problems: "

      Tasks: 3 (limit: 4194303)

     Memory: 21.7M

        CPU: 4.584s

     CGroup: /system.slice/condor.service

             ââ1131 /usr/sbin/condor_master -f

             ââ1342 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

             ââ1343 condor_shared_port

 

Mar 28 08:25:39 physlin11 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:25:39 physlin11 htcondor[1145]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).

Mar 28 08:25:39 physlin11 htcondor[1155]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).

Mar 28 08:25:39 physlin11 htcondor[1161]: Not changing ROOT_MAXKEYS_BYTES (/proc/sys/kernel/keys/root_maxbytes): new value (25000000) <= old value (250>

Mar 28 08:25:39 physlin11 htcondor[1164]: Changing PIPE_USER_PAGES_SOFT (/proc/sys/fs/pipe-user-pages-soft) from 16384 to 131072

 

 

 

- - - -

Nathan Moore

Professor of Physics

Winona State University

 

 


From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Sent: Monday, March 27, 2023 5:17 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Moore, Nathan T <nmoore@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Configuring Ubuntu22 cluster, missing config step?

 

On 3/22/2023 9:15 PM, Moore, Nathan T via HTCondor-users wrote:

Iâm configuring condor on a small cluster of linux boxes.  Machines are on an isolated network.  IPâs are static and Iâm using /etc/hosts instead of dns for hosts to resolve machine name to IP. 

 

It seems like I missed a configure step in the install/configure procedure. Suggestions appreciated.

 


Hi Nathan,

Some suggestions:

1. When using host names, HTCondor (like a lot of other internet software) really wants to see fully qualified domain names (FQDN) as the first entry in each line /etc/hosts, and aliases that are not fully qualified can follow.  Ie the first entry on each line should have a host.domain.edu.  For instance, for an entry in /etc/hosts should look like this:
  
    199.17.158.6 physlin2.winona.edu physlin2

and not like this:


    199.17.158.6 physlin2 physlin2.winona.edu

2. If for some reason you cannot edit/change the /etc/hosts file, or it is still broken, you can set the DEFAULT_DOMAIN_NAME HTCondor config knob.  See https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#DEFAULT_DOMAIN_NAME    Example:

    # echo "DEFAULT_DOMAIN_NAME = winona.edu" > /etc/condor/config.d/15-SetDomain.conf

3. Confirm that your /etc/nsswitch.conf file uses files for host lookups, i.e. it should have a line similar to "hosts: files dns"

Hope the above helps,
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/