Ok, looks like Iâm at the simple security setup stage. Appreciate all the suggestions!
Quick Configuration of Security
Note: This method of configuring security is experimental. Many tools and daemons that send administrative
commands between machines (e.g. condor_off, condor_drain, or condor_defrag) wonât work without further setup. We plan to remove this limitation in future releases.
While pool administrators with complex configurations or application developers may need to understand the full security model described in this chapter, HTCondor strives to make it
easy to enable reasonable security settings for new pools.
When installing a new pool, assuming you are on a trusted network and there are no unprivileged users logged in to the submit hosts:
1.
Start HTCondor on your central manager host (containing the condor_collector daemon) first. For a fresh install, this will automatically generate
a random key in the file specified by SEC_TOKEN_POOL_SIGNING_KEY_FILE (defaulting
to /etc/condor/passwords.d/POOL on Linux and $(RELEASE_DIR)\tokens.sk\POOL on
Windows).
2.
Install an auto-approval rule on the central manager using condor_token_request_auto_approve.
This automatically approves any daemons starting on a specified network for a fixed period of time. For example, to auto-authorize any daemon on the network 192.168.0.0/24 for
the next hour (3600 seconds), run the following command from the central manager:
3.
$
condor_token_request_auto_approve
-netblock
192.168.0.0/24
-lifetime
3600
4.
Within the auto-approval ruleâs lifetime, start the submit and execute hosts inside the appropriate network. The token requests for the corresponding daemons
(the condor_master, condor_startd, and condor_schedd) will be automatically approved and installed into /etc/condor/tokens.d/;
this will authorize the daemon to advertise to the collector. By default, auto-generated tokens do not have an expiration.
This quick-configuration requires no configuration changes beyond the default settings. More complex cases, such as those where the network is not trusted, are covered in the Token
Authentication section.
- - - -
Nathan Moore
Professor of Physics
Winona State University
From: Moore, Nathan T
Sent: Tuesday, March 28, 2023 12:33 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Jason Patton <jpatton@xxxxxxxxxxx>
Subject: RE: [HTCondor-users] FW: Configuring Ubuntu22 cluster, missing config step?
Thanks for the suggestion Jason.
First, one of the networking errors was (I think) the omission of the line
127.0.1.1 physlin12 (or whatever the hostname isâ)
From the /etc/hosts file. This is apparently a required bugfix that I errantly deleted from every machines /etc/hosts when installing/configuring the cluster.
After fixing this and re-installing condor with a command of the form
sudo curl -fsSL
https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="RanchoGordo23" /bin/bash -s -- --no-dry-run --execute physlin2.physics.winona.edu
I now see a new error on execute and submit nodes.
(submit)
nathan@physlin20:~$ cat /var/log/condor/* | grep ERROR:
03/28/23 12:17:53 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS
(execute)
nathan@physlin11:~$ cat /var/log/condor/* | grep ERROR:
03/28/23 12:17:11 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS
This is the error that was resolved by fixing /etc/hosts to include â127.0.1.1 whatever_the_hostname_isâ
nathan@physlin11:~$ cat /var/log/condor/MasterLog | grep ERROR
03/28/23 08:26:03 ERROR: SECMAN:2003:TCP connection to collector 199.17.158.2 failed.
- - - -
Nathan Moore
Professor of Physics
Winona State University
Hi Nathan,
On one of the problem execute machines, is there anything obvious/useful in /var/log/condor/StartLog? (Maybe also check the MasterLog or ProcLog.) From the systemctl output, it looks like the startd daemon is not running, but the master,
procd, and shared port daemons are able to run.
Thanks for the detailed suggestions Todd!
It's interesting to learn I've been writing /etc/hosts wrong for 20 years. Yikes!
System still doesn't work, any further suggestions appreciated.
Here's what I've got (now) in /etc/hosts (fixed!)
nathan@physlin6:~$ cat /etc/hosts
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
nathan@physlin6:~$ cat /etc/nsswitch.conf
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.
hosts: files mdns4_minimal [NOTFOUND=return] dns
netgroup: nis
domain names seem to resolve
nathan@physlin20:~$ nslookup physlin2
Server:âââââââââââ127.0.0.53
Address:ââââ127.0.0.53#53
reverse lookup seems to work
nathan@physlin20:~$ nslookup 199.17.158.2
2.158.17.199.in-addr.arpaâââââname = physlin2.
nathan@physlin20:~$ host 199.17.158.2
2.158.17.199.in-addr.arpa domain name pointer physlin2.
Afer re-installing condor on the nodes, restarting, on the submit machines the condor_status command still lists no machines available
nathan@physlin6:~$ condor_q
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for nathan: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
nathan@physlin6:~$ condor_status
It looks like the appropriate ports are open
nathan@physlin6:~$ nmap physlin6
Nmap scan report for physlin6 (199.17.158.6)
Host is up (0.00014s latency).
Not shown: 995 closed ports
Nmap done: 1 IP address (1 host up) scanned in 0.07 seconds
nathan@physlin6:~$ nmap physlin2
Nmap scan report for physlin2 (199.17.158.2)
Host is up (0.00039s latency).
Not shown: 996 closed ports
Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds
the condor service seems to run ok on a submit node
âââââânathan@physlin6:~$ systemctl status condor
â condor.service - Condor Distributed High-Throughput-Computing
Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-03-28 08:25:47 CDT; 6min ago
Main PID: 1046 (condor_master)
Status: "All daemons are responding"
Tasks: 4 (limit: 4194303)
CGroup: /system.slice/condor.service
ââ1046
/usr/sbin/condor_master -f
ââ1151
condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130
ââ1152
condor_shared_port
Mar 28 08:25:47 physlin6 systemd[1]: Started Condor Distributed High-Throughput-Computing.
Mar 28 08:25:48 physlin6 htcondor[1076]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).
Mar 28 08:25:48 physlin6 htcondor[1099]: Changing LOCAL_PORT_RANGE (/proc/sys/net/ipv4/ip_local_port_range) from 32768 60999 to
1024 65535
Mar 28 08:25:48 physlin6 htcondor[1105]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (4096).
Mar 28 08:25:48 physlin6 htcondor[1121]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760
also ok on central manager
nathan@physlin2:~$ systemctl status condor
â condor.service - Condor Distributed High-Throughput-Computing
Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-03-28 08:26:02 CDT; 7min ago
Main PID: 1016 (condor_master)
Status: "All daemons are responding"
Tasks: 5 (limit: 4194303)
CGroup: /system.slice/condor.service
ââ1016
/usr/sbin/condor_master -f
ââ1126
condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130
ââ1127
condor_shared_port
Mar 28 08:26:02 physlin2 systemd[1]: Started Condor Distributed High-Throughput-Computing.
Mar 28 08:26:02 physlin2 htcondor[1081]: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value
(1000000).
Mar 28 08:26:02 physlin2 htcondor[1095]: Changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max) from 212992 to 10485760
Not sure if condor is running ok on the execute.compute node though?
nathan@physlin11:~$ systemctl status condor
â condor.service - Condor Distributed High-Throughput-Computing
Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-03-28 08:25:39 CDT; 8min ago
Main PID: 1131 (condor_master)
Tasks: 3 (limit: 4194303)
CGroup: /system.slice/condor.service
ââ1131
/usr/sbin/condor_master -f
ââ1342
condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130
ââ1343 condor_shared_port
Mar 28 08:25:39 physlin11 systemd[1]: Started Condor Distributed High-Throughput-Computing.
Mar 28 08:25:39 physlin11 htcondor[1145]: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (9223372036854775807).
Mar 28 08:25:39 physlin11 htcondor[1155]: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value
(4096).
Mar 28 08:25:39 physlin11 htcondor[1161]: Not changing ROOT_MAXKEYS_BYTES (/proc/sys/kernel/keys/root_maxbytes): new value (25000000)
<= old value (250>
Mar 28 08:25:39 physlin11 htcondor[1164]: Changing PIPE_USER_PAGES_SOFT (/proc/sys/fs/pipe-user-pages-soft) from 16384 to 131072
- - - -
Nathan Moore
Professor of Physics
Winona State University
On 3/22/2023 9:15 PM, Moore, Nathan T via HTCondor-users wrote:
Iâm configuring condor on a small cluster of linux boxes. Machines are on an isolated network. IPâs are static and Iâm using /etc/hosts instead of dns for hosts to resolve machine name to IP.
It seems like I missed a configure step in the install/configure procedure. Suggestions appreciated.
Hi Nathan,
Some suggestions:
1. When using host names, HTCondor (like a lot of other internet software) really wants to see fully qualified domain names (FQDN) as the
first entry in each line /etc/hosts, and aliases that are not fully qualified can follow. Ie the first entry on each line should have a
host.domain.edu. For instance, for an entry in /etc/hosts should look like this:
199.17.158.6 physlin2.winona.edu
physlin2
and not like this:
199.17.158.6 physlin2
physlin2.winona.edu
2. If for some reason you cannot edit/change the /etc/hosts file, or it is still broken, you can set the DEFAULT_DOMAIN_NAME HTCondor config knob. See
https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#DEFAULT_DOMAIN_NAME Example:
# echo "DEFAULT_DOMAIN_NAME =
winona.edu" > /etc/condor/config.d/15-SetDomain.conf
3. Confirm that your /etc/nsswitch.conf file uses files for host lookups, i.e. it should have a line similar to "hosts: files dns"
Hope the above helps,
Todd
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/