Dear all,
 
I am having a problem with joining two pools via flocking, and I 
suspect it is mainly my assumptions that are wrong.
 
Background
--------------------------
 
Pool A has 10 machines
Pool B has 10 machines
All machines are running WinXP 64-bit on private networks without 
domain controllers.
The cluster heads on pool A and B are connected via a VPN, but none of 
the other nodes of each cluster are connected, nor is IP traffic 
forwarded.
 
I am running these pools in collaboration with someone else and I 
don't have direct access to pool B.
 
To join the two pools together both master(cluster heads) have 
BIND_ALL_INTERFACES = true so that they can operate on their internal 
network interface, and the VPN interface.
 
We have also added the name of the opposing pool's cluster head into 
our "hosts" file eg. 192.168.1.10 clusterhead_A
 
We have then added that name (not the ip address) to the condor_config 
file in the condor FLOCK_TO and FLOCK_FROM macros.
 
Our HOSTALLOW_READ and HOSTALLOW_WRITE are both *, which I know is 
bad, but the clusters are behind firewalls and VPNs and so only 
accessible by trusted parties. I was hoping to reduce the number of 
hoops flocking had to jump through and hope to bring this back up to 
some more secure settings.
 
I can run "condor_q -name clusterhead_A"  and see the opposing pools 
queue, but if I use the IP address, ie "condor_q -name 192.168.1.15" I 
get the error message:
"Error: Collector has no record of schedd/submitter"
 
"condor_q -global" also successfully returns the queue from the other 
pool.
 
I have not changed the NO_DNS macro nor the DEFAULT_DOMAIN_NAME macro 
in the condor_config file, both are commented out. If I do this and 
run condor_reconfig, then I get the following error message :
 
ERROR "gethostname failed, errno = 0" at line 266 in file 
..\src\condor_c++_util\my_hostname.C
 
------------------------
 
The problem I get is as the subject line reads, and as you can see 
I've tried a few things.
 
What should I do to get condor flocking working such that jobs migrate 
and run on the other pool, without requiring a direct connection from 
my head to their execute nodes?
 
I was under the impression that jobs would migrate to the opposing 
pool's queue and then be submitted and managed by the opposing pool 
with the results being passed back. Am I wrong about this?
 
From my log files I can see my cluster head is trying to directly 
connect to the remote cluster's nodes, which it can't do. It is also 
seeming to have trouble connecting to itself on its VPN IP address 
even though I have BIND_ALL_INTERFACE=true.
 
If anyone has any ideas/solutions please do reply,
 
Peter
 
Ps. I can ping the remote cluster head across the VPN and also the VPN 
IP address of my own machine.
 
*Dr Peter Myerscough-Jackopson *
Engineer, MAC Ltd
phone: +44 (0) 23 8076 7808  fax: +44 (0) 23 8076 0602
email: peter.myerscough-jackopson@xxxxxxxxxx  web: www.macltd.com
Multiple Access Communications Limited is a company registered in
England at Delta House, Southampton Science Park, Southampton,
SO16 7NS, United Kingdom with Company Number 1979185
 
------------------------------------------------------------------------
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/