[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Configuring Ubuntu22 cluster, missing config step?



I’m configuring condor on a small cluster of linux boxes.  Machines are on an isolated network.  IP’s are static and I’m using /etc/hosts instead of dns for hosts to resolve machine name to IP. 

 

It seems like I missed a configure step in the install/configure procedure. Suggestions appreciated.

 

Central manager is 199.17.158.2

Submit nodes are 199.17.158.[6,20]

Execute nodes are 199.17.158.[11-18]

 

I can ssh to and from all of these nodes.  I have run the curl installer script on all nodes

               https://htcondor.readthedocs.io/en/latest/getting-htcondor/admin-quick-start.html

 

and then started the service with

sudo systemctl enable condor; sudo systemctl start condor

 

It looks like the system is running, per

nathan@physlin6:~$ sudo systemctl status condor

● condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Wed 2023-03-22 19:22:20 CDT; 1h 35min ago

   Main PID: 1002 (condor_master)

     Status: "All daemons are responding"

      Tasks: 4 (limit: 4194303)

     Memory: 17.5M

        CPU: 1.471s

     CGroup: /system.slice/condor.service

             ─1002 /usr/sbin/condor_master -f

             ─1356 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

             ─1357 condor_shared_port

             └─1358 condor_schedd

 

Mar 22 19:22:20 physlin6 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 22 19:22:21 physlin6 htcondor[1071]: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value (1000000).

 

However, when I run condor_status, the command returns nothing, immediately. 

 

Related, sometimes the condor_q command returns the following:

nathan@physlin2:~$ condor_q

Error: Can't find address for schedd physlin2

 

Extra Info: You probably saw this error because the condor_schedd is not

running on the machine you are trying to query. If the condor_schedd is not

running, the Condor system will not be able to find an address and port to

connect to and satisfy this request. Please make sure the Condor daemons are

running and try again.

 

Extra Info: If the condor_schedd is running on the machine you are trying to

query and you still see the error, the most likely cause is that you have

setup a personal Condor, you have not defined SCHEDD_NAME in your

condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE

setting. You must define either or both of those settings in your config

file, or you must use the -name option to condor_q. Please see the Condor

manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.

 

Or, on an execute node

 

nathan@physlin11:~$ sudo systemctl status condor

● condor.service - Condor Distributed High-Throughput-Computing

     Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled)

     Active: active (running) since Wed 2023-03-22 19:24:52 CDT; 1h 48min ago

   Main PID: 1117 (condor_master)

     Status: "Problems: "

      Tasks: 3 (limit: 4194303)

     Memory: 18.3M

        CPU: 8.140s

     CGroup: /system.slice/condor.service

             ─  1117 /usr/sbin/condor_master -f

             ─  1412 condor_shared_port

             └─186564 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130

 

Mar 22 19:24:52 physlin11 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Mar 22 21:03:14 physlin11 systemd[1]: condor.service: Current command vanished from the unit file, execution of the command list won't be resumed.

Mar 22 21:03:16 physlin11 systemd[1]: condor.service: Got notification message from PID 1117, but reception is disabled.

Mar 22 21:03:27 physlin11 systemd[1]: condor.service: Got notification message from PID 1117, but reception is disabled.

Mar 22 21:03:38 physlin11 systemd[1]: condor.service: Got notification message from PID 1117, but reception is disabled.

Mar 22 21:03:49 physlin11 systemd[1]: condor.service: Got notification message from PID 1117, but reception is disabled.

 

 

- - - -

Nathan Moore

Professor of Physics

Winona State University