Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] InfiniBand
- Date: Wed, 24 Feb 2016 14:12:00 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] InfiniBand
On 2/24/2016 12:37 PM, Jonathan Knudson wrote:
I have been tasked with setting up a new server cluster. There is one
head node and 12 compute nodes. This system is connected via
InfiniBand. I read in the documentation that IB is useful for parallel
jobs. Can I utilize this network with the Vanilla Universe?
The main advantage of IB is low latency, which is helpful for parallel
jobs that pass many small messages between compute nodes (i.e. MPI
jobs). Many sites will only want MPI traffic on their IB network, and
will purposefully direct all non-MPI traffic (such as HTCondor traffic,
NFS traffic, ssh/scp traffic, etc... anything that is not super
sensitive to latency) to ethernet in order to not decrease performance
of their MPI jobs.
The HTCondor config knobs will not control the pathway for traffic
coming from your jobs themselves, they just control traffic originating
from HTCondor daemons such as file transfer performed by the
condor_starter (if you are not using a shared file system) and system
traffic such as classads to/from the collector. You will have to
configure your MPI library or your shared filesystem (NFS, Gluster,
whatever) to use the IB network separately --- HTCondor's config file
has no impact on those services.
Unless you are asking HTCondor to transfer large amounts of data via
your job submit file transfer_input|output_files knobs, I am not sure
there is any advantage for you setting up HTCondor to use the IB. And
even in that case, file transfer is primarily a bandwidth issue, not a
latency issue.
If you still want to setup HTCondor daemons to use the IB network, I
think the issue you are facing below is you do not have any routing
setup between your IB network and your ethernet network. That means
that if you set NETWORK_INTERFACE = 192.168.0.* on your CM, you likely
want to setup CONDOR_HOST = 192.168.0.179 everywhere else -- i.e. use an
explicit IP address for CONDOR_HOST, since perhaps a DNS name is giving
an address for the ethernet interface.
Hope the above helps
Todd
Right now I have Condor installed on the head node and one of the
compute nodes. I have read through the documentation about having
multiple NICs. On the compute node I have the BIND_ALL_INTERFACES set
to true. On the CM I have the set the NETWORK_INTERFACE
=192.168.0.179. This is the IB address. But I still get the Failed to
connect error. The CM is on our production network and on the
192.168.0.x network and has 3 IP address assigned.
When I added NETWORK_INTERFACE =192.168.0.184 to the compute node and
changed BIND_ALL_INTERFACES = False or commented out, I get the error
Can’t connect to local master.
When using either the IB or GbE network I get the "Error: communication
error
CEDAR:6001:Failed to connect to <192.168.0.179:9618
<http://192.168.0.179:9618>>".
CM - DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD
Node DAEMON_LIST = MASTER, STARTD
This might be a Linux issue which is another problem in itself…
Thanks
Jon
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685