[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unable to correctly create multi-machine pool



On 4/24/2024 1:51 PM, Thurimella, Vijay wrote:
Hello,

I am trying to run a this Pegasus workflow for an experiment I am running. In order to run the workflow, I was trying to create a multi-machine condor pool using the instructions in the documentation from here. Whenever I run through the commands on the webpage and get to the point where I run condor_status on the submit node. I am getting the following error. 

Error: communication error
SECMAN:2007:Failed to end classad message.  

I am very new to HTCondor so any advice to help me get my multi machine pool running would be greatly appreciated.

I am creating this multi-machine pool using cloud lab. Each node is a m510 machine running ubuntu 22.04.02 LTS. The machines are all connected to the same network and each node has a hostname node{num}. I made node0 the central manager, node1 the submit node, and node2/node3 execute nodes. The commands I ran to create the multi-machine pool were:


$ curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --central-manager node0

$ curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --submit node0

$curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --execute node0

Thanks,
Vijay

Hello Vijay, and welcome to the HTCondor community!

Some quick thoughts :

Re the commands you ran above, just want to confirm that you replaced "$htcondor_password" with an actual password you want to use to secure your pool, correct?  If the above is literally what you entered, I am guessing the shell would expand the password into an empty string. 

Another thought... you mentioned the nodes are all connected to the same network, but perhaps the nodes are running a firewall?  By default HTCondor uses network port 9618, so that will need to be open.

Also, do your machines have /etc/hosts or DNS setup such that, for instance, on node1 you can resolve hostname node0 ?  I.e. while logged into node1, can you do "ping node0" ?

Finally, this may not matter, but personally I have always used fully-qualified hostnames to specify the central manager.  In other words, on the lines above, instead of node0, perhaps "node0.mydomain.org" (whatever your domain is).  Besides fully-qualified hostnames, I have also used IP addresses.

Hope the above helps.  If you are still having trouble, please drop another note.

regards,
Todd