Hi Zack, That did it!I'm going to post this exchange on htcondor-users so that if anyone else ever has the problem I did, they might be able to find your solution.
Thanks! On 5/24/19 3:36 PM, Zach Miller wrote:
Hi William, You no longer need BIND_ALL_INTERFACES, as the default now is true. But since you have multiple interfaces, it seems HTCondor is selecting the wrong interface to use by default. I would try setting: NETWORK_INTERFACE = 10.44.7.84 to see if that solves the problem. (Check the manual, I believe you can also specify the name of the interface (a la "eth0") if you would prefer) Cheers, -zach ïOn 5/24/19, 12:51 PM, "William Seligman" <seligman@xxxxxxxxxxxx> wrote: Hi Zach,Thanks for the offer of help. I've already tried reinstalling 8.8from scratch and starting from a fresh condor_config file. It worked if the condor master was flying solo. The problems start when I include the slave nodes.Before I deluge you with debug outputs, let me describe the network setup and what I think is the problem.The condor master is a dual-home host sitting on the demilitarized zone of our network's firewall. The two interfaces are:olga.mydomain.org = 12.34.56.84olga-local.mydomain.org 10.44.7.84
The idea, which worked in 7.7 (and other condor pools I've worked with in the past) is that the users can login remotely to any system in our demilitarized zone and submit jobs to batch nodes in our local NAT'ed network.For this "olga cluster", we use a shared filesystem with a shared condor_config file, so I know both the condor master (olga) and the various batch nodes see the same configuration. In condor_config: DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDDCONDOR_HOST = olga-local.mydomain.org ALLOW_WRITE = 12.34.56.0/24, 10.44.0.0/16 BIND_ALL_INTERFACES = trueThere's more, of course, and I'll send you the full dump if youthink we'll need it.Let's consider a slave nodes: olga00.mydomain.org = 10.44.14.0 DAEMON_LIST = MASTER, STARTD When I submit a job on olga that runs on any of the batch nodes, including olga00, it swaps from Idle to Running and back again. Thejob log saysError from slot1@xxxxxxxxxxxxxxxxxxx: Could not initiate filetransferWhen I look at StarterLog.slot1 on olga00, it seems clear what's
happening:
05/24/19 13:43:16 (pid:1313536) DaemonCore: command socket at<10.44.14.0:9618?addrs=10.44.14.0-9618&noUDP&sock=1313247_1034_32> 05/24/19 13:43:16 (pid:1313536) DaemonCore: private command socket at <10.44.14.0:9618?addrs=10.44.14.0-9618&noUDP&sock=1313247_1034_32> 05/24/19 13:43:17 (pid:1313536) Communicating with shadow <12.34.56.84:9618?addrs=12.34.56.84-9618&noUDP&sock=2002206_0144_30> 05/24/19 13:43:17 (pid:1313536) Submitting machine is "olga-local.mydomain.org"Somehow, though olga00 should only be communicating over the NAT'ed local network, it's trying to communicate with the condor master's shadow process via our public network. Since the local machinescan't see the public interface on the dual-home host, we're stuck. Bear in mind that the result of the `hostname` on the condor master is "olga". I should add that I use a DNS trick to simplify my users' lives when the login to machines elsewhere on my site. On olga, if I use DNS:# host olgaolga.mydomain.org has address 12.34.56.84On a machine that's on our local network, the result is different: # host olgaolga.mydomain.org has address 10.44.7.84 olga.mydomain.org has address 12.34.56.84This so the users can either login remotely or use a laptop attachedto our NAT'ed network and still use a command like "ssh olga" without having to think about which network they're on. This has nothing directly to do with the olgaNN slave nodes, since the users can't login to them; it's just a general operational principle for our site. Any ideas? Do you need to see full dumps?
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature