[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] InfiniBand



I have been tasked with setting up a new server cluster. There is one head node and 12 compute nodes. This system is connected via InfiniBand. I read in the documentation that IB is useful for parallel jobs. Can I utilize this network with the Vanilla Universe?

Â

Right now I have Condor installed on the head node and one of the compute nodes. I have read through the documentation about having multiple NICs. On the compute node I have the BIND_ALL_INTERFACES set to true. On the CM I have the set the NETWORK_INTERFACE = 192.168.0.179. This is the IB address. But I still get the Failed to connect error. The CM is on our production network and on the 192.168.0.x network and has 3 IP address assigned.

Â

When I added NETWORK_INTERFACE = 192.168.0.184 to the compute node and changed BIND_ALL_INTERFACES = False Âor commented out, I get the error Canât connect to local master.

Â

When using either the IB or GbE network I get the "Error: communication error

CEDAR:6001:Failed to connect to <192.168.0.179:9618>".

Â

Â

CM - DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD

Node DAEMON_LIST = MASTER, STARTD

Â

This might be a Linux issue which is another problem in itselfâ

Â

Â

Thanks

Â

Jon

Â