hi
i have been trying to get condor working on rocks clusters. Front end condor works and one of the compute nodes work, but rest of them have a problem. There are total 8 compute nodes. ---------------- Condor_config.local looks like # # Condor local configuration file for frontend node. # COLLECTOR_NAME = Collector at protos CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxxx CONDOR_DEVELOPERS = NONE CONDOR_DEVELOPERS_COLLECTOR = NONE CONDOR_HOST = protos.cs.bgsu.edu CONDOR_IDS = 407.407 DAEMON_LIST = MASTER, SCHEDD, COLLECTOR, NEGOTIATOR EMAIL_DOMAIN = $(FULL_HOSTNAME) FILESYSTEM_DOMAIN = cs.bgsu.edu HOSTALLOW_WRITE = protos.cs.bgsu.edu, *.local JAVA = /usr/java/jdk1.5.0_07/bin/java LOCAL_DIR = /home/condor LOCK = /tmp/condor-lock.$(HOSTNAME) MAIL = /bin/mail NEGOTIATOR_INTERVAL = 120 NETWORK_INTERFACE = 129.1.64.210 RELEASE_DIR = /opt/condor UID_DOMAIN = cs.bgsu.edu ----------------------------------- and that on client nodes looks like CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxxx CONDOR_DEVELOPERS = NONE CONDOR_DEVELOPERS_COLLECTOR = NONE CONDOR_HOST = protos.cs.bgsu.edu CONDOR_IDS = 407.407 DAEMON_LIST = MASTER, STARTD EMAIL_DOMAIN = $(FULL_HOSTNAME) FILESYSTEM_DOMAIN = cs.bgsu.edu HOSTALLOW_WRITE = protos.cs.bgsu.edu, *.local JAVA = /usr/java/jdk1.5.0_07/bin/java LOCAL_DIR = /home/condor LOCK = /tmp/condor-lock.$(HOSTNAME) MAIL = /bin/mail NEGOTIATOR_INTERVAL = 120 NETWORK_INTERFACE = 10.255.255.254 RELEASE_DIR = /opt/condor UID_DOMAIN = cs.bgsu.edu # First set JAVA_MAXHEAP_ARGUMENT to null, to disable the default of max RAM JAVA_MAXHEAP_ARGUMENT = # Now set the argument with the Sun-specific maximum allowable value JAVA_EXTRA_ARGUMENTS = -Xmx1906m ----------------------------------- I was able to make the processes schedd, startd and master run on one of the nodes But when i try to do the same on the others, there is a problem, I get the following message from condor MasterLog 11/14 11:18:05 Using config source: /opt/condor/etc/condor_config 11/14 11:18:05 Using local config sources: 11/14 11:18:05 /opt/condor/etc/condor_config.local 11/14 11:18:05 Failed to bind to command ReliSock 11/14 11:18:05 (Make sure your IP address is correct in /etc/hosts.) 11/14 11:18:05 ERROR "BindAnyCommandPort() failed" at line 8386 in file daemon_core.C IP address is correct in /etc/hosts file Also trying to do condor_q on these nodes i get ------------------- Failed to fetch ads from: <10.255.255.254:45932> : compute-0-1.local CEDAR:6001:Failed to connect to --------------------- and condor_status give ---- [condor@compute-0-1 log]$ condor_status CEDAR:6001:Failed to connect to <129.1.64.210:9618> Error: Couldn't contact the condor_collector on protos.cs.bgsu.edu. --- on the head node ps - aux | grep condor shows condor 2978 0.0 0.2 7536 2380 ? Ss Nov10 2:37 /opt/condor/sbin/condor_master condor 2995 0.0 0.2 7508 3036 ? Ss Nov10 0:13 condor_collector -f condor 3096 0.0 0.3 8964 4036 ? Ss Nov10 0:01 condor_schedd -f condor 3097 0.0 0.3 7412 3080 ? Ss Nov10 0:12 condor_negotiator -f I can see that collector is running on the head node. I just could not figure out what and where i am missing something. Please help. -----------------------------------
Samir Khanal CS Grad Student Hayes 226 Bowling Green State University Bowling Green, OH 43402 skhanal@xxxxxxxx |