Hello, all,
I am a beginner of condor and am really having problem managing our cluster. It is a small cluster with one master-node (as the sever of condor) and 16 compute nodes. We recently disassembled the cluster and moved it to another place, and after we plugged everything back in and turned on all the machines, we found condor was not working. I noticed that since the IP address for the master-node has changed, probably something need to be changed in condor configuration as well. So I opened the "condor_config.local" file on the master-node node, and updated the entry of "NETWORK_INTERFACE". Then I was able to start condor:
# ps -ef | grep condor
condor 3639 1 0 Sep03 ?
00:00:11 /opt/condor/sbin/condor_master
condor 3651 3639 0 Sep03 ? 00:00:00 condor_collector -f
condor 3652 3639 0 Sep03 ? 00:00:01 condor_schedd -f
condor 3653 3639 0 Sep03 ? 00:00:00 condor_negotiator -f
root 15130 15111 0 12:55 pts/1 00:00:00 grep condor
But when I type "condor_q", sometimes it returns the queue, but most of the time it returns:
-- Failed to fetch ads from: <ip adress> : hostname
It seems to be very unstable. I have rebooted the master-node once and it did not help. Also jobs in the queue are still idling, they have not been sent to the compute nodes (the system has been on for almost one day now, and I am able to ssh to those nodes). I
am not sure if there is anything else I need to change upon the moving, or something went wrong. Any helps? Thanks
Li Xi
Department of Chemical and Biological Engineering
University of Wisconsin-Madison