[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] FW: Configuring Ubuntu22 cluster, missing config step?



Hi Nathan,

On one of the problem execute machines, is there anything obvious/useful in /var/log/condor/StartLog? (Maybe also check the MasterLog or ProcLog.)ÂFrom the systemctl output, it looks like the startd daemon is not running, but the master, procd, and shared port daemons are able to run.

Jason Patton

On Tue, Mar 28, 2023 at 11:20âAM Moore, Nathan T via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Â

Thanks for the detailed suggestions Todd!

Â

It's interesting to learn I've been writing /etc/hosts wrong for 20 years. Yikes!

Â

System still doesn't work, any further suggestions appreciated.

Â

Here's what I've got (now) in /etc/hosts (fixed!)

Â

nathan@physlin6:~$ cat /etc/hosts

199.17.158.2  Âphyslin2.physics.winona.edu   physlin2 # condor central manager

199.17.158.6  Âphyslin6.physics.winona.edu   physlin6 # condor submit

Â

199.17.158.11  physlin11.physics.winona.edu  Âphyslin11 # condor exec

199.17.158.12  physlin12.physics.winona.edu  Âphyslin12

199.17.158.13  physlin13.physics.winona.edu  Âphyslin13

199.17.158.14  physlin14.physics.winona.edu  Âphyslin14

199.17.158.15  physlin15.physics.winona.edu  Âphyslin15

199.17.158.16  physlin16.physics.winona.edu  Âphyslin16

199.17.158.17  physlin17.physics.winona.edu  Âphyslin17

199.17.158.18  physlin18.physics.winona.edu  Âphyslin18

Â

199.17.158.20  physlin20.physics.winona.edu  Âphyslin20 # condor submit

199.17.158.21  physlin21.physics.winona.edu  Âphyslin21

199.17.158.22  physlin22.physics.winona.edu  Âphyslin22

199.17.158.23  physlin23.physics.winona.edu  Âphyslin23

199.17.158.24  physlin24.physics.winona.edu  Âphyslin24

199.17.158.25  physlin25.physics.winona.edu  Âphyslin25

199.17.158.26  physlin26.physics.winona.edu  Âphyslin26

Â

127.0.0.1âââlocalhost

Â

# The following lines are desirable for IPv6 capable hosts

::1 Â Â ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

Â

Â

looking at files first

Â

nathan@physlin6:~$ cat /etc/nsswitch.conf

# /etc/nsswitch.conf

#

# Example configuration of GNU Name Service Switch functionality.

# If you have the `glibc-doc-reference' and `info' packages installed, try:

# `info libc "Name Service Switch"' for information about this file.

Â

passwd: Â Â Â Â files systemd

group: Â Â Â Â Âfiles systemd

shadow: Â Â Â Â files

gshadow: Â Â Â Âfiles

Â

hosts: Â Â Â Â Âfiles mdns4_minimal [NOTFOUND=return] dns

networks: Â Â Â files

Â

protocols: Â Â Âdb files

services: Â Â Â db files

ethers: Â Â Â Â db files

rpc: Â Â Â Â Â Âdb files

Â

netgroup: Â Â Â nis

Â

domain names seem to resolve

Â

nathan@physlin20:~$ nslookup physlin2

Server:âââââââââââ127.0.0.53

Address:ââââ127.0.0.53#53

Â

Name:âphyslin2

Address: 199.17.158.2

Â

reverse lookup seems to work

Â

nathan@physlin20:~$ nslookup 199.17.158.2

2.158.17.199.in-addr.arpaâââââname = physlin2.physics.winona.edu.

2.158.17.199.in-addr.arpaâââââname = physlin2.

Â

nathan@physlin20:~$ host 199.17.158.2

2.158.17.199.in-addr.arpa domain name pointer physlin2.physics.winona.edu.

2.158.17.199.in-addr.arpa domain name pointer physlin2.

Â

Afer re-installing condor on the nodes, restarting, on the submit machines the condor_status command still lists no machines available

Â

nathan@physlin6:~$ condor_q

Â

Â

-- Schedd: physlin6.physics.winona.edu : <199.17.158.6:9618?... @ 03/28/23 08:27:22

OWNER BATCH_NAME Â Â ÂSUBMITTED Â DONE Â RUN Â ÂIDLE Â HOLD ÂTOTAL JOB_IDS

Â

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Total for nathan: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Â

nathan@physlin6:~$ condor_status

nathan@physlin6:~$

Â

It looks like the appropriate ports are open

Â

nathan@physlin6:~$ nmap physlin6

Starting Nmap 7.80 ( https://nmap.org ) at 2023-03-28 08:30 CDT

Nmap scan report for physlin6 (199.17.158.6)

Host is up (0.00014s latency).

Not shown: 995 closed ports

PORT Â Â STATE SERVICE

22/tcp  open Âssh

80/tcp  open Âhttp

8651/tcp open Âunknown

8652/tcp open Âunknown

9618/tcp open Âcondor

Â

Nmap done: 1 IP address (1 host up) scanned in 0.07 seconds

nathan@physlin6:~$ nmap physlin2

Starting Nmap 7.80 ( https://nmap.org ) at 2023-03-28 08:31 CDT

Nmap scan report for physlin2 (199.17.158.2)

Host is up (0.00039s latency).

Not shown: 996 closed ports

PORT Â Â STATE SERVICE

22/tcp  open Âssh

111/tcp Âopen Ârpcbind

8649/tcp open Âunknown

9618/tcp open Âcondor

Â

Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds

Â

the condor service seems to run ok on a submit node

Â

âââââânathan@physlin6:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

  ÂLoaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

  ÂActive: active (running) since Tue 2023-03-28 08:25:47 CDT; 6min ago

 ÂMain PID: 1046 (condor_master)

  ÂStatus: "All daemons are responding"

   Tasks: 4 (limit: 4194303)

  ÂMemory: 17.0M

    CPU: 253ms

  ÂCGroup: /system.slice/condor.service

      Âââ1046 /usr/sbin/condor_master -f

      Âââ1151 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

      Âââ1152 condor_shared_port

      Âââ1153 condor_schedd

Â

Mar 28 08:25:47 physlin6 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:25:48 physlin6 htcondor[1076]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).

Mar 28 08:25:48 physlin6 htcondor[1099]: Changing LOCAL_PORT_RANGE (/proc/sys/net/ipv4/ip_local_port_range) from 32768 Â Â Â Â60999 to 1024 65535

Mar 28 08:25:48 physlin6 htcondor[1105]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).

Mar 28 08:25:48 physlin6 htcondor[1121]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760

Â

also ok on central manager

Â

nathan@physlin2:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

  ÂLoaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

  ÂActive: active (running) since Tue 2023-03-28 08:26:02 CDT; 7min ago

 ÂMain PID: 1016 (condor_master)

  ÂStatus: "All daemons are responding"

   Tasks: 5 (limit: 4194303)

  ÂMemory: 19.0M

    CPU: 461ms

  ÂCGroup: /system.slice/condor.service

      Âââ1016 /usr/sbin/condor_master -f

      Âââ1126 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

      Âââ1127 condor_shared_port

      Âââ1128 condor_collector

      Âââ1136 condor_negotiator

Â

Mar 28 08:26:02 physlin2 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:26:02 physlin2 htcondor[1081]: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value (1000000).

Mar 28 08:26:02 physlin2 htcondor[1095]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760

Â

Not sure if condor is running ok on the execute.compute node though?

Â

nathan@physlin11:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

  ÂLoaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

  ÂActive: active (running) since Tue 2023-03-28 08:25:39 CDT; 8min ago

 ÂMain PID: 1131 (condor_master)

  ÂStatus: "Problems: "

   Tasks: 3 (limit: 4194303)

  ÂMemory: 21.7M

    CPU: 4.584s

  ÂCGroup: /system.slice/condor.service

      Âââ1131 /usr/sbin/condor_master -f

      Âââ1342 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

      Âââ1343 condor_shared_port

Â

Mar 28 08:25:39 physlin11 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:25:39 physlin11 htcondor[1145]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).

Mar 28 08:25:39 physlin11 htcondor[1155]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).

Mar 28 08:25:39 physlin11 htcondor[1161]: Not changing ROOT_MAXKEYS_BYTES (/proc/sys/kernel/keys/root_maxbytes): new value (25000000) <= old value (250>

Mar 28 08:25:39 physlin11 htcondor[1164]: Changing PIPE_USER_PAGES_SOFT (/proc/sys/fs/pipe-user-pages-soft) from 16384 to 131072

Â

Â

Â

- - - -

Nathan Moore

Professor of Physics

Winona State University

Â

Â


From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Sent: Monday, March 27, 2023 5:17 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Moore, Nathan T <nmoore@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Configuring Ubuntu22 cluster, missing config step?

Â

On 3/22/2023 9:15 PM, Moore, Nathan T via HTCondor-users wrote:

Iâm configuring condor on a small cluster of linux boxes. Machines are on an isolated network. IPâs are static and Iâm using /etc/hosts instead of dns for hosts to resolve machine name to IP.Â

Â

It seems like I missed a configure step in the install/configure procedure. Suggestions appreciated.

Â


Hi Nathan,

Some suggestions:

1. When using host names, HTCondor (like a lot of other internet software) really wants to see fully qualified domain names (FQDN) as the first entry in each line /etc/hosts, and aliases that are not fully qualified can follow. Ie the first entry on each line should have a host.domain.edu. For instance, for an entry in /etc/hosts should look like this:
ÂÂ
ÂÂÂ 199.17.158.6 physlin2.winona.edu physlin2

and not like this:


  199.17.158.6 physlin2 physlin2.winona.edu

2. If for some reason you cannot edit/change the /etc/hosts file, or it is still broken, you can set the DEFAULT_DOMAIN_NAME HTCondor config knob. See https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#DEFAULT_DOMAIN_NAME Example:

ÂÂÂ # echo "DEFAULT_DOMAIN_NAME = winona.edu" > /etc/condor/config.d/15-SetDomain.conf

3. Confirm that your /etc/nsswitch.conf file uses files for host lookups, i.e. it should have a line similar to "hosts: files dns"

Hope the above helps,
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/