I think the fundamental problem is a combination of your hosts file and the fact that you seem to be forcing HTCondor to use 127.0.0.1 as the preferred IP address.
We lookup tuna and get 127.0.0.1 and then we lookup 127.0.0.1 and the first answer in the hosts file is localhost, so that becomes the hostname.
I think you either need to remove tuna from the hosts file, give it a different IP address (like the public IP address), or make it the first entry in the hosts file for 127.0.0.1
But I'm confused how you can have a 3 node pool that is working at all if you are telling HTCondor to use 127.0.0.1 for communication. The nodes should be unable to talk to each other.
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Lyle Pakula <Lyle@xxxxxxxxxxxxxxxx>
Sent: Sunday, August 1, 2021 9:33 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor 9.1
Hi John,
Thanks for the help.
1/ NETWORK_INTERFACE is the same on all machines
lyle@tuna$ condor_config_val -v NETWORK_INTERFACE
NETWORK_INTERFACE = *
# at: <Default>
# raw: NETWORK_INTERFACE = *
FYI my /etc/hosts on all machines follows a standard layout, ie for @tuna
lyle@tuna$ cat /etc/hosts
127.0.0.1 localhost tuna
127.0.1.1 tuna.ocwen.com tuna
all machines have a /etc/hostname file containing their "hostname" but domainname is blank.
2/ UID_DOMAIN is also similar on all machines, that is default of
lyle@grenadier:$ condor_config_val -v UID_DOMAIN
UID_DOMAIN = localhost
# at: <Default>
# raw: UID_DOMAIN = $(FULL_HOSTNAME)
... What I tried
It looked to me that condor is not picking up the actual hostname and perhaps this is because we have no domainname configured.
lyle@grenadier:/etc/condor/config.d$ hostname
grenadier
lyle@grenadier:/etc/condor/config.d$ condor_config_val -v HOSTNAME
HOSTNAME = localhost
# at: <Detected>
# raw: HOSTNAME = localhost
lyle@grenadier:/etc/condor/config.d$ condor_config_val -v FULL_HOSTNAME
FULL_HOSTNAME = localhost
# at: <Detected>
# raw: FULL_HOSTNAME = localhost
* I tried pointing NETWORK_INTERFACE to 127.0.1.1 on all machines and also to the CENTRAL MANAGER ip (something i read) but this did not change what condor picks up as the hostname.
* I tried setting the UID_DOMAIN=ocwen.com on all machines but this did not work (everything still runs as nobody) and i suspect this is because the hostname is not picked up
correctly as well
Thanks, Lyle
I think slots are appearing as localhost because your condor_config is telling condor to use localhost as the primary network interface.
What does the condor_config have set for NETWORK_INTERFACE ?
Try running
condor_config_val -v NETWORK_INTERFACE
By the way, you can see all of your configuration that differs from the default HTCondor configuration by running
condor_config_val -summary
When a job runs, files will be written as nobody if the job runs as nobody, which happens when HTCondor does not think that the submit node and the execute node have the same set of user ids. It decides this by comparing the value of UID_DOMAIN on both of
these machines.
Try running
condor_config_val -v UID_DOMAIN
on both the submit machine and the execute machine, what is the value?
Now having files writting as nobody on the execute node is not a problem when HTCondor is doing file transfer, because it will change ownership of the files as it transfers the results back. but if you are using a shared file system
you may need to do some additional configuration.
Instructions for setting up HTCondor to use shared files system is here
-tj
Hi Everyone and thanks for everyone's help in advance!
* Starting with a basic setup (3 Machines, 3 roles) + NAS mounted on all machines.
* Vanilla universe Jobs read/write to and from the NAS
Question 1 - Why are slots apearing as "localhost" and not the machine name they are actually on?
lyle@tuna:~$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@localhost LINUX X86_64 Unclaimed Idle 0.000 1990 0+00:30:39
slot2@localhost LINUX X86_64 Unclaimed Idle 0.000 1990 0+00:30:36
slot3@localhost LINUX X86_64 Unclaimed Idle 0.000 1990 0+00:30:33
slot4@localhost LINUX X86_64 Unclaimed Idle 0.000 1990 0+00:30:32
slot5@localhost LINUX X86_64 Unclaimed Idle 0.000 1990 0+00:30:31
slot6@localhost LINUX X86_64 Unclaimed Idle 0.000 1990 0+00:30:42
slot7@localhost LINUX X86_64 Unclaimed Idle 0.000 1990 0+00:30:41
slot8@localhost LINUX X86_64 Unclaimed Idle 0.000 1990 0+00:30:41
Question 2 - Files are written as nobody:nouser, how can we change this?
Problem here is that the written files are unreadable/unwriteable to the submitter
Tried this but did not work
Thanks, Lyle
--
AE CAPITAL
Ground Floor, 555 Bourke Street, Melbourne Australia 3000
p +61 3 9020 7801
m +61 (0)434 872 054
w
http://www.aecapital.com.au
AE Capital Pty Limited (ACN 153 242 865) is regulated by the Australian Securities & Investments Commission and is a Corporate Authorised Representative of JFM Pty Limited (ACN 125
150 656), holder of an Australian Financial Services Licence (AFSL 314585). AE Capital Pty Limited is a member of the National Futures Association (ID 0498660).
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
AE CAPITAL
Ground Floor, 555 Bourke Street, Melbourne Australia 3000
p +61 3 9020 7801
m +61 (0)434 872 054
w
http://www.aecapital.com.au
AE Capital Pty Limited (ACN 153 242 865) is regulated by the Australian Securities & Investments Commission and is a Corporate Authorised Representative of JFM Pty Limited (ACN 125
150 656), holder of an Australian Financial Services Licence (AFSL 314585). AE Capital Pty Limited is a member of the National Futures Association (ID 0498660).
|