[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] FW: Configuring Ubuntu22 cluster, missing config step?



Ok, looks like Iâm at the simple security setup stage.  Appreciate all the suggestions!

 

Quick Configuration of Security

Note: This method of configuring security is experimental. Many tools and daemons that send administrative commands between machines (e.g. condor_offcondor_drain, or condor_defrag) wonât work without further setup. We plan to remove this limitation in future releases.

While pool administrators with complex configurations or application developers may need to understand the full security model described in this chapter, HTCondor strives to make it easy to enable reasonable security settings for new pools.

When installing a new pool, assuming you are on a trusted network and there are no unprivileged users logged in to the submit hosts:

1.     Start HTCondor on your central manager host (containing the condor_collector daemon) first. For a fresh install, this will automatically generate a random key in the file specified by SEC_TOKEN_POOL_SIGNING_KEY_FILE (defaulting to /etc/condor/passwords.d/POOL on Linux and $(RELEASE_DIR)\tokens.sk\POOL on Windows).

2.     Install an auto-approval rule on the central manager using condor_token_request_auto_approve. This automatically approves any daemons starting on a specified network for a fixed period of time. For example, to auto-authorize any daemon on the network 192.168.0.0/24 for the next hour (3600 seconds), run the following command from the central manager:

3.  $ condor_token_request_auto_approve -netblock 192.168.0.0/24 -lifetime 3600

4.     Within the auto-approval ruleâs lifetime, start the submit and execute hosts inside the appropriate network. The token requests for the corresponding daemons (the condor_mastercondor_startd, and condor_schedd) will be automatically approved and installed into /etc/condor/tokens.d/; this will authorize the daemon to advertise to the collector. By default, auto-generated tokens do not have an expiration.

This quick-configuration requires no configuration changes beyond the default settings. More complex cases, such as those where the network is not trusted, are covered in the Token Authentication section.

 

 

 

 

- - - -

Nathan Moore

Professor of Physics

Winona State University

 

From: Moore, Nathan T
Sent: Tuesday, March 28, 2023 12:33 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Jason Patton <jpatton@xxxxxxxxxxx>
Subject: RE: [HTCondor-users] FW: Configuring Ubuntu22 cluster, missing config step?

 

Thanks for the suggestion Jason. 

 

First, one of the networking errors was (I think) the omission of the line

127.0.1.1 physlin12         (or whatever the hostname isâ)

From the /etc/hosts file.  This is apparently a required bugfix that I errantly deleted from every machines /etc/hosts when installing/configuring the cluster.

 

After fixing this and re-installing condor with a command of the form

 

sudo curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="RanchoGordo23" /bin/bash -s -- --no-dry-run --execute physlin2.physics.winona.edu

 

I now see a new error on execute and submit nodes.

 

(submit)

nathan@physlin20:~$ cat /var/log/condor/* | grep ERROR:

03/28/23 12:17:53 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS

(execute)

nathan@physlin11:~$ cat /var/log/condor/* | grep ERROR:

03/28/23 12:17:11 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS

 

 

 

This is the error that was resolved by fixing /etc/hosts to include â127.0.1.1               whatever_the_hostname_isâ

               nathan@physlin11:~$ cat /var/log/condor/MasterLog | grep ERROR

03/28/23 08:26:03 ERROR: SECMAN:2003:TCP connection to collector 199.17.158.2 failed.

 

 

- - - -

Nathan Moore

Professor of Physics

Winona State University

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jason Patton via HTCondor-users
Sent: Tuesday, March 28, 2023 11:28 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Jason Patton <jpatton@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] FW: Configuring Ubuntu22 cluster, missing config step?

 

Hi Nathan,

 

On one of the problem execute machines, is there anything obvious/useful in /var/log/condor/StartLog? (Maybe also check the MasterLog or ProcLog.) From the systemctl output, it looks like the startd daemon is not running, but the master, procd, and shared port daemons are able to run.

 

Jason Patton

 

On Tue, Mar 28, 2023 at 11:20âAM Moore, Nathan T via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

 

Thanks for the detailed suggestions Todd!

 

It's interesting to learn I've been writing /etc/hosts wrong for 20 years. Yikes!

 

System still doesn't work, any further suggestions appreciated.

 

Here's what I've got (now) in /etc/hosts (fixed!)

 

nathan@physlin6:~$ cat /etc/hosts

199.17.158.2    physlin2.physics.winona.edu     physlin2 # condor central manager

199.17.158.6    physlin6.physics.winona.edu     physlin6 # condor submit

 

199.17.158.11   physlin11.physics.winona.edu    physlin11 # condor exec

199.17.158.12   physlin12.physics.winona.edu    physlin12

199.17.158.13   physlin13.physics.winona.edu    physlin13

199.17.158.14   physlin14.physics.winona.edu    physlin14

199.17.158.15   physlin15.physics.winona.edu    physlin15

199.17.158.16   physlin16.physics.winona.edu    physlin16

199.17.158.17   physlin17.physics.winona.edu    physlin17

199.17.158.18   physlin18.physics.winona.edu    physlin18

 

199.17.158.20   physlin20.physics.winona.edu    physlin20 # condor submit

199.17.158.21   physlin21.physics.winona.edu    physlin21

199.17.158.22   physlin22.physics.winona.edu    physlin22

199.17.158.23   physlin23.physics.winona.edu    physlin23

199.17.158.24   physlin24.physics.winona.edu    physlin24

199.17.158.25   physlin25.physics.winona.edu    physlin25

199.17.158.26   physlin26.physics.winona.edu    physlin26

 

127.0.0.1âââlocalhost

 

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

 

 

looking at files first

 

nathan@physlin6:~$ cat /etc/nsswitch.conf

# /etc/nsswitch.conf

#

# Example configuration of GNU Name Service Switch functionality.

# If you have the `glibc-doc-reference' and `info' packages installed, try:

# `info libc "Name Service Switch"' for information about this file.

 

passwd:         files systemd

group:          files systemd

shadow:         files

gshadow:        files

 

hosts:          files mdns4_minimal [NOTFOUND=return] dns

networks:       files

 

protocols:      db files

services:       db files

ethers:         db files

rpc:            db files

 

netgroup:       nis

 

domain names seem to resolve

 

nathan@physlin20:~$ nslookup physlin2

Server:âââââââââââ127.0.0.53

Address:ââââ127.0.0.53#53

 

Name:âphyslin2

Address: 199.17.158.2

 

reverse lookup seems to work

 

nathan@physlin20:~$ nslookup 199.17.158.2

2.158.17.199.in-addr.arpaâââââname = physlin2.physics.winona.edu.

2.158.17.199.in-addr.arpaâââââname = physlin2.

 

nathan@physlin20:~$ host 199.17.158.2

2.158.17.199.in-addr.arpa domain name pointer physlin2.physics.winona.edu.

2.158.17.199.in-addr.arpa domain name pointer physlin2.

 

Afer re-installing condor on the nodes, restarting, on the submit machines the condor_status command still lists no machines available

 

nathan@physlin6:~$ condor_q

 

 

-- Schedd: physlin6.physics.winona.edu : <199.17.158.6:9618?... @ 03/28/23 08:27:22

OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

 

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Total for nathan: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

 

nathan@physlin6:~$ condor_status

nathan@physlin6:~$

 

It looks like the appropriate ports are open

 

nathan@physlin6:~$ nmap physlin6

Starting Nmap 7.80 ( https://nmap.org ) at 2023-03-28 08:30 CDT

Nmap scan report for physlin6 (199.17.158.6)

Host is up (0.00014s latency).

Not shown: 995 closed ports

PORT     STATE SERVICE

22/tcp   open  ssh

80/tcp   open  http

8651/tcp open  unknown

8652/tcp open  unknown

9618/tcp open  condor

 

Nmap done: 1 IP address (1 host up) scanned in 0.07 seconds

nathan@physlin6:~$ nmap physlin2

Starting Nmap 7.80 ( https://nmap.org ) at 2023-03-28 08:31 CDT

Nmap scan report for physlin2 (199.17.158.2)

Host is up (0.00039s latency).

Not shown: 996 closed ports

PORT     STATE SERVICE

22/tcp   open  ssh

111/tcp  open  rpcbind

8649/tcp open  unknown

9618/tcp open  condor

 

Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds

 

the condor service seems to run ok on a submit node

 

âââââânathan@physlin6:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Tue 2023-03-28 08:25:47 CDT; 6min ago

   Main PID: 1046 (condor_master)

     Status: "All daemons are responding"

      Tasks: 4 (limit: 4194303)

     Memory: 17.0M

        CPU: 253ms

     CGroup: /system.slice/condor.service

             ââ1046 /usr/sbin/condor_master -f

             ââ1151 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

             ââ1152 condor_shared_port

             ââ1153 condor_schedd

 

Mar 28 08:25:47 physlin6 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:25:48 physlin6 htcondor[1076]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).

Mar 28 08:25:48 physlin6 htcondor[1099]: Changing LOCAL_PORT_RANGE (/proc/sys/net/ipv4/ip_local_port_range) from 32768        60999 to 1024 65535

Mar 28 08:25:48 physlin6 htcondor[1105]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).

Mar 28 08:25:48 physlin6 htcondor[1121]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760

 

also ok on central manager

 

nathan@physlin2:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Tue 2023-03-28 08:26:02 CDT; 7min ago

   Main PID: 1016 (condor_master)

     Status: "All daemons are responding"

      Tasks: 5 (limit: 4194303)

     Memory: 19.0M

        CPU: 461ms

     CGroup: /system.slice/condor.service

             ââ1016 /usr/sbin/condor_master -f

             ââ1126 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

             ââ1127 condor_shared_port

             ââ1128 condor_collector

             ââ1136 condor_negotiator

 

Mar 28 08:26:02 physlin2 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:26:02 physlin2 htcondor[1081]: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value (1000000).

Mar 28 08:26:02 physlin2 htcondor[1095]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760

 

Not sure if condor is running ok on the execute.compute node though?

 

nathan@physlin11:~$ systemctl status condor

â condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Tue 2023-03-28 08:25:39 CDT; 8min ago

   Main PID: 1131 (condor_master)

     Status: "Problems: "

      Tasks: 3 (limit: 4194303)

     Memory: 21.7M

        CPU: 4.584s

     CGroup: /system.slice/condor.service

             ââ1131 /usr/sbin/condor_master -f

             ââ1342 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

             ââ1343 condor_shared_port

 

Mar 28 08:25:39 physlin11 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 28 08:25:39 physlin11 htcondor[1145]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).

Mar 28 08:25:39 physlin11 htcondor[1155]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).

Mar 28 08:25:39 physlin11 htcondor[1161]: Not changing ROOT_MAXKEYS_BYTES (/proc/sys/kernel/keys/root_maxbytes): new value (25000000) <= old value (250>

Mar 28 08:25:39 physlin11 htcondor[1164]: Changing PIPE_USER_PAGES_SOFT (/proc/sys/fs/pipe-user-pages-soft) from 16384 to 131072

 

 

 

- - - -

Nathan Moore

Professor of Physics

Winona State University

 

 


From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Sent: Monday, March 27, 2023 5:17 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Moore, Nathan T <nmoore@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Configuring Ubuntu22 cluster, missing config step?

 

On 3/22/2023 9:15 PM, Moore, Nathan T via HTCondor-users wrote:

Iâm configuring condor on a small cluster of linux boxes.  Machines are on an isolated network.  IPâs are static and Iâm using /etc/hosts instead of dns for hosts to resolve machine name to IP. 

 

It seems like I missed a configure step in the install/configure procedure. Suggestions appreciated.

 


Hi Nathan,

Some suggestions:

1. When using host names, HTCondor (like a lot of other internet software) really wants to see fully qualified domain names (FQDN) as the first entry in each line /etc/hosts, and aliases that are not fully qualified can follow.  Ie the first entry on each line should have a host.domain.edu.  For instance, for an entry in /etc/hosts should look like this:
  
    199.17.158.6 physlin2.winona.edu physlin2

and not like this:


    199.17.158.6 physlin2 physlin2.winona.edu

2. If for some reason you cannot edit/change the /etc/hosts file, or it is still broken, you can set the DEFAULT_DOMAIN_NAME HTCondor config knob.  See https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#DEFAULT_DOMAIN_NAME    Example:

    # echo "DEFAULT_DOMAIN_NAME = winona.edu" > /etc/condor/config.d/15-SetDomain.conf

3. Confirm that your /etc/nsswitch.conf file uses files for host lookups, i.e. it should have a line similar to "hosts: files dns"

Hope the above helps,
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/