Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Flocking - remote nodes matching, but not executing
- Date: Thu, 25 Oct 2007 11:01:39 +0100
- From: "Peter Myerscough-Jackopson" <peter.myerscough-jackopson@xxxxxxxxxx>
- Subject: [Condor-users] Flocking - remote nodes matching, but not executing
Dear
all,
I am having a problem with
joining two pools via flocking, and I suspect it is mainly my assumptions that
are wrong.
Background
--------------------------
Pool A has 10
machines
Pool B has 10
machines
All machines are running
WinXP 64-bit on private networks without domain controllers.
The cluster heads on pool A
and B are connected via a VPN, but none of the other nodes of each cluster are
connected, nor is IP traffic forwarded.
I am running these pools in
collaboration with someone else and I don't have direct access to pool B.
To join the two pools
together both master(cluster heads) have BIND_ALL_INTERFACES = true so that they
can operate on their internal network interface, and the VPN
interface.
We have also added the name
of the opposing pool's cluster head into our "hosts" file eg. 192.168.1.10
clusterhead_A
We have then added that
name (not the ip address) to the condor_config file in the condor FLOCK_TO and
FLOCK_FROM macros.
Our HOSTALLOW_READ and
HOSTALLOW_WRITE are both *, which I know is bad, but the clusters are behind
firewalls and VPNs and so only accessible by trusted parties. I was hoping to
reduce the number of hoops flocking had to jump through and hope to bring this
back up to some more secure settings.
I can run "condor_q -name
clusterhead_A" and see the opposing pools queue, but if I use the IP
address, ie "condor_q -name 192.168.1.15" I get the error message:
"Error: Collector has no
record of schedd/submitter"
"condor_q -global" also
successfully returns the queue from the other pool.
I have not changed the
NO_DNS macro nor the DEFAULT_DOMAIN_NAME macro in the condor_config file, both
are commented out. If I do this and run condor_reconfig, then I get the
following error message :
ERROR "gethostname failed,
errno = 0" at line 266 in file
..\src\condor_c++_util\my_hostname.C
------------------------
The problem I get is as the
subject line reads, and as you can see I've tried a few
things.
What should I do to get
condor flocking working such that jobs migrate and run on the other pool,
without requiring a direct connection from my head to their execute
nodes?
I was under the impression
that jobs would migrate to the opposing pool's queue and then be submitted and
managed by the opposing pool with the results being passed back. Am I wrong
about this?
From my log
files I can see my cluster head is trying to directly connect to the remote
cluster's nodes, which it can't do. It is also seeming to have trouble
connecting to itself on its VPN IP address even though I have
BIND_ALL_INTERFACE=true.
If anyone has any
ideas/solutions please do reply,
Peter
Ps. I can ping the remote
cluster head across the VPN and also the VPN IP address of my own
machine.
Dr
Peter Myerscough-Jackopson
Engineer, MAC Ltd
phone: +44 (0) 23 8076 7808 fax: +44 (0) 23 8076 0602
email:
peter.myerscough-jackopson@xxxxxxxxxx web: www.macltd.com
Multiple Access Communications
Limited is a company registered in
England at Delta House, Southampton
Science Park, Southampton,
SO16 7NS, United Kingdom with Company Number
1979185