Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Going from Condor 7.7 to HTCondor 8.8

Date: Fri, 24 May 2019 17:33:54 -0400
From: William Seligman <seligman@xxxxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Going from Condor 7.7 to HTCondor 8.8

Hi Zack,

That did it!

I'm going to post this exchange on htcondor-users so that if anyone elseever has the problem I did, they might be able to find your solution.


Thanks!

On 5/24/19 3:36 PM, Zach Miller wrote:

Hi William,

You no longer need BIND_ALL_INTERFACES, as the default now is true.

But since you have multiple interfaces, it seems HTCondor is
selecting  the wrong interface to use by default. I would try setting:
   NETWORK_INTERFACE = 10.44.7.84

to see if that solves the problem.


(Check the manual, I believe you can also specify the name of the interface (a la "eth0") if you would prefer)


Cheers,
-zach


ïOn 5/24/19, 12:51 PM, "William Seligman" <seligman@xxxxxxxxxxxx> wrote:

     Hi Zach,

Thanks for the offer of help. I've already tried reinstalling 8.8

from scratch and starting from a fresh condor_config file. It worked
if the condor master was flying solo. The problems start when I
include the slave nodes.

Before I deluge you with debug outputs, let me describe the networksetup and what I think is the problem.


The condor master is a dual-home host sitting on the demilitarized
zone of our network's firewall. The two interfaces are:

olga.mydomain.org = 12.34.56.84

     olga-local.mydomain.org 10.44.7.84

The idea, which worked in 7.7 (and other condor pools I've worked
with in the past) is that the users can login remotely to any system
in our demilitarized zone and submit jobs to batch nodes in our local
NAT'ed network.
For this "olga cluster", we use a shared filesystem with a sharedcondor_config file, so I know both the condor master (olga) and thevarious batch nodes see the same configuration.In condor_config:DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDD
     CONDOR_HOST     = olga-local.mydomain.org
     ALLOW_WRITE = 12.34.56.0/24, 10.44.0.0/16
     BIND_ALL_INTERFACES = true
There's more, of course, and I'll send you the full dump if you
think we'll need it.
Let's consider a slave nodes:olga00.mydomain.org = 10.44.14.0DAEMON_LIST = MASTER, STARTDWhen I submit a job on olga that runs on any of the batch nodes,including olga00, it swaps from Idle to Running and back again. The
job log says
Error from slot1@xxxxxxxxxxxxxxxxxxx: Could not initiate file
     transfer
When I look at StarterLog.slot1 on olga00, it seems clear what's

happening:

05/24/19 13:43:16 (pid:1313536) DaemonCore: command socket at

     <10.44.14.0:9618?addrs=10.44.14.0-9618&noUDP&sock=1313247_1034_32>
     05/24/19 13:43:16 (pid:1313536) DaemonCore: private command socket at
     <10.44.14.0:9618?addrs=10.44.14.0-9618&noUDP&sock=1313247_1034_32>
     05/24/19 13:43:17 (pid:1313536) Communicating with shadow
     <12.34.56.84:9618?addrs=12.34.56.84-9618&noUDP&sock=2002206_0144_30>
     05/24/19 13:43:17 (pid:1313536) Submitting machine is
     "olga-local.mydomain.org"

Somehow, though olga00 should only be communicating over the NAT'edlocal network, it's trying to communicate with the condor master'sshadow process via our public network. Since the local machines

can't see the public interface on the dual-home host, we're stuck.

Bear in mind that the result of the `hostname` on the condor master
is "olga".

I should add that I use a DNS trick to simplify my users' lives when
the login to machines elsewhere on my site. On olga, if I use DNS:

# host olga

     olga.mydomain.org has address 12.34.56.84

On a machine that's on our local network, the result is different:# host olga

     olga.mydomain.org has address 10.44.7.84
     olga.mydomain.org has address 12.34.56.84

This so the users can either login remotely or use a laptop attached

to our NAT'ed network and still use a command like "ssh olga"
without having to think about which network they're on. This has
nothing directly to do with the olgaNN slave nodes, since the users
can't login to them; it's just a general operational principle for
our site.

Any ideas? Do you need to see full dumps?

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Prev by Date: Re: [HTCondor-users] Linux Containers to consider with HTCondor in the near future?
Next by Date: [HTCondor-users] Setting preemption based on submit boxes
Previous by thread: Re: [HTCondor-users] Going from Condor 7.7 to HTCondor 8.8
Next by thread: [HTCondor-users] Setting preemption based on submit boxes
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Going from Condor 7.7 to HTCondor 8.8