Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unable to correctly create multi-machine pool

Date: Fri, 26 Apr 2024 13:58:57 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Unable to correctly create multi-machine pool

On 4/24/2024 1:51 PM, Thurimella, Vijay wrote:

Hello,

I am trying to run a this Pegasus workflow for an experiment I am running. In order to run the workflow, I was trying to create a multi-machine condor pool using the instructions in the documentation from here. Whenever I run through the commands on the webpage and get to the point where I run condor_status on the submit node. I am getting the following error.

Error: communication error

SECMAN:2007:Failed to end classad message.

I am very new to HTCondor so any advice to help me get my multi machine pool running would be greatly appreciated.

I am creating this multi-machine pool using cloud lab. Each node is a m510 machine running ubuntu 22.04.02 LTS. The machines are all connected to the same network and each node has a hostname node{num}. I made node0 the central manager, node1 the submit node, and node2/node3 execute nodes. The commands I ran to create the multi-machine pool were:

$ curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --central-manager node0

$ curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --submit node0

$curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --execute node0

Thanks,

Vijay

Hello Vijay, and welcome to the HTCondor community!

Some quick thoughts :

Re the commands you ran above, just want to confirm that you replaced "$htcondor_password" with an actual password you want to use to secure your pool, correct? If the above is literally what you entered, I am guessing the shell would expand the password into an empty string.

Another thought... you mentioned the nodes are all connected to the same network, but perhaps the nodes are running a firewall? By default HTCondor uses network port 9618, so that will need to be open.

Also, do your machines have /etc/hosts or DNS setup such that, for instance, on node1 you can resolve hostname node0 ? I.e. while logged into node1, can you do "ping node0" ?

Finally, this may not matter, but personally I have always used fully-qualified hostnames to specify the central manager. In other words, on the lines above, instead of node0, perhaps "node0.mydomain.org" (whatever your domain is). Besides fully-qualified hostnames, I have also used IP addresses.

Hope the above helps. If you are still having trouble, please drop another note.

regards,
Todd

References:
- [HTCondor-users] Unable to correctly create multi-machine pool
  - From: Thurimella, Vijay

Prev by Date: Re: [HTCondor-users] Question about MAXJOBS
Next by Date: [HTCondor-users] CondorCE: testing SSL based submission with CE client tools?
Previous by thread: [HTCondor-users] Unable to correctly create multi-machine pool
Next by thread: [HTCondor-users] Question about MAXJOBS
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Unable to correctly create multi-machine pool