Hi, I'm attempting to get flocking working from a dedicated cluster to
a cluster of workstations, with only partial success.
I think that the issue may be related to internal vs. external
networks, and so would like some input from the condor community.
We have a dedicated condor cluster which communicates completely on an
internal 10.101.x.x network, which is on an internal network interface
on each node and the head. The head node and 10 submit nodes also have
an external interface, into which people login and submit jobs. However
the schedd and startd etc. all talk to each within this pool on the
internal network.
We would like this cluster to be able to flock to a workstation cluster
that is completely on the external network. The head node of this
workstation cluster can also see the internal network, but its
condor_config has explicitly set the interface to be the external one.
FLOCK_TO and FLOCK_FROM were set on both heads nodes with the explcit
external interfaces on the other head nodes, and ALLOW_READ and
ALLOW_WRITE were set to allow all machines on the external network to
be able to interact with the head node of the workstation cluster.
What I've seen is matchmaking between the two head nodes, and the
workstation nodes preparing to receive the job but timing out. I've
increased the timeout from 2 to 20 minutes without an improvement.
My question is: can the schedd on the submit nodes, which normally talk
to the other nodes on an internal network and interface, be able to
contact condor on the workstations on the external interface? If not,
what options do we have for getting the clusters to flock?