I’m configuring condor on a small cluster of linux boxes. Machines are on an isolated network. IP’s are static and I’m using /etc/hosts instead of dns for hosts to resolve machine name to IP.
It seems like I missed a configure step in the install/configure procedure. Suggestions appreciated. Central manager is 199.17.158.2 Submit nodes are 199.17.158.[6,20] Execute nodes are 199.17.158.[11-18] I can ssh to and from all of these nodes. I have run the curl installer script on all nodes
https://htcondor.readthedocs.io/en/latest/getting-htcondor/admin-quick-start.html and then started the service with sudo systemctl enable condor; sudo systemctl start condor It looks like the system is running, per nathan@physlin6:~$ sudo systemctl status condor ● condor.service - Condor Distributed High-Throughput-Computing Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2023-03-22 19:22:20 CDT; 1h 35min ago Main PID: 1002 (condor_master) Status: "All daemons are responding" Tasks: 4 (limit: 4194303) Memory: 17.5M CPU: 1.471s CGroup: /system.slice/condor.service
├─1002 /usr/sbin/condor_master -f
├─1356 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000
-S 60 -C 130
├─1357 condor_shared_port └─1358 condor_schedd Mar 22 19:22:20 physlin6 systemd[1]: Started Condor Distributed High-Throughput-Computing. Mar 22 19:22:21 physlin6 htcondor[1071]: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value (1000000). However, when I run condor_status, the command returns nothing, immediately.
Related, sometimes the condor_q command returns the following: nathan@physlin2:~$ condor_q Error: Can't find address for schedd physlin2 Extra Info: You probably saw this error because the condor_schedd is not running on the machine you are trying to query. If the condor_schedd is not running, the Condor system will not be able to find an address and port to connect to and satisfy this request. Please make sure the Condor daemons are running and try again. Extra Info: If the condor_schedd is running on the machine you are trying to query and you still see the error, the most likely cause is that you have setup a personal Condor, you have not defined SCHEDD_NAME in your condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE setting. You must define either or both of those settings in your config file, or you must use the -name option to condor_q. Please see the Condor manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. Or, on an execute node nathan@physlin11:~$ sudo systemctl status condor ● condor.service - Condor Distributed High-Throughput-Computing Loaded: loaded (/lib/systemd/system/condor.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2023-03-22 19:24:52 CDT; 1h 48min ago Main PID: 1117 (condor_master) Status: "Problems: " Tasks: 3 (limit: 4194303) Memory: 18.3M CPU: 8.140s CGroup: /system.slice/condor.service
├─ 1117 /usr/sbin/condor_master -f ├─
1412 condor_shared_port └─186564 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 130 Mar 22 19:24:52 physlin11 systemd[1]: Started Condor Distributed High-Throughput-Computing. Mar 22 21:03:14 physlin11 systemd[1]: condor.service: Current command vanished from the unit file, execution of the command list won't be resumed. Mar 22 21:03:16 physlin11 systemd[1]: condor.service: Got notification message from PID 1117, but reception is disabled. Mar 22 21:03:27 physlin11 systemd[1]: condor.service: Got notification message from PID 1117, but reception is disabled. Mar 22 21:03:38 physlin11 systemd[1]: condor.service: Got notification message from PID 1117, but reception is disabled. Mar 22 21:03:49 physlin11 systemd[1]: condor.service: Got notification message from PID 1117, but reception is disabled. - - - - Nathan Moore Professor of Physics Winona State University |