Hello,
I've installed condor on 3 Ubuntu based machines using the latest release of condor from the debian repository. This was based on using apt-get install condor after updating my local sources. Everything installed without problems. That install uses the personal condor configuration which must be customized.
I have 1 machine that is the host and 2 machines that are submit/execute. I specialized the local config file. Here it is for the submit machine. I used ufw to open the ports needed.
I don't think that the connection to the negotiator is working correctly. When I execute condor_q it reports the queue, quickly. When I execute condor_q -analyze, it takes a very long time and then reports a communication error to the negotiation daemon. I've also included sample output from what I think are the relevant log files. I've searched on this issue using google etc., while there were others that had something similar I wasn't able to find a clear/clean resolution. I'm wondering if anyone has some insights on this.
setti@MDR-U1-VM:~$ condor_q -analyze
Error: Could not connect to negotiator (MDR-BELL4164-U1)
-----------------------------------------
## What machine is your central manager?
#CONDOR_HOST = $(FULL_HOSTNAME)
CONDOR_HOST = 130.184.159.107
ALLOW_ADMINISTRATOR = $(CONDOR_HOST),$(IP_ADDRESS),$(FULL_HOSTNAME)
## Pool's short description
COLLECTOR_NAME = UA-INEG-CONDOR at $(CONDOR_HOST)
CONDOR_ADMIN = rossetti@xxxxxxxx
ALLOW_READ = $(CONDOR_HOST), $(FULL_HOSTNAME), $(IP_ADDRESS), 130.184.*.*,130.184.159.107, 130.184.158.107,130.184.159.6
ALLOW_WRITE = $(CONDOR_HOST), $(FULL_HOSTNAME), $(IP_ADDRESS), 130.184.*.*,130.184.159.107, 130.184.158.107,130.184.159.6
HIGHPORT = 9700
LOWPORT = 9600
NETWORK_INTERFACE = 130.184.158.107
## When is this machine willing to start a job?
START = TRUE
## When to suspend a job?
SUSPEND = FALSE
## When to nicely stop a job?
## (as opposed to killing it instantaneously)
PREEMPT = FALSE
## When to instantaneously kill a preempting job
## (e.g. if a job is in the pre-empting stage for too long)
KILL = FALSE
## This macro determines what daemons the condor_master will start and keep its watchful eyes on.
## The list is a comma or space separated list of subsystem names
DAEMON_LIST = MASTER, SCHEDD, STARTD
------------------------------------------------------
08/17/11 20:43:42 Communicating with shadow<130.184.158.107:9609?noUDP>
08/17/11 20:43:42 Submitting machine is "uaf38185.ddns.uark.edu"
8/17/11 20:43:42 ERROR: the submitting host claims to be in our UidDomain (MDR-U1-VM), yet its hostname (uaf38185.ddns.uark.edu) does not match. If the above hostname is actually an IP address, Condor could not perform a reverse DNS lookup to convert the IP back into a name. To solve this problem, you can either correctly configure DNS to allow the reverse lookup, or you can enable TRUST_UID_DOMAIN in your condor configuration.
08/17/11 20:43:42 IPVERIFY: unable to resolve IP address of MDR-U1-VMs
08/17/11 20:43:42 IPVERIFY: unable to resolve IP address of MDR-U1-VMs
08/17/11 20:23:42 (pid:1988) Sent ad to central manager for rossetti@MDR-U1-VM
08/17/11 20:23:42 (pid:1988) Sent ad to 1 collectors for rossetti@MDR-U1-VM
08/17/11 20:24:03 (pid:1988) attempt to connect to<130.184.159.107:9646> failed: timed out after 20 seconds.
08/17/11 20:24:03 (pid:1988) Failed to send RESCHEDULE to negotiator MDR-BELL4164-U1: SECMAN:2004:Failed to create security session to<130.184.159.107:9646> with TCP.
|SECMAN:2003:TCP connection to<130.184.159.107:9646> failed.
08/17/11 20:43:42 (pid:1988) Sent ad to central manager for rossetti@MDR-U1-VM
08/17/11 20:43:42 (pid:1988) Sent ad to 1 collectors for rossetti@MDR-U1-VM
08/17/11 20:43:42 (pid:1988) Haven't heard from negotiator, trying to claim local startd @<130.184.158.107:9675>
08/17/11 20:43:42 (pid:1988) Checking consistency running and runnable jobs
08/17/11 20:43:42 (pid:1988) Tables are consistent
08/17/11 20:43:42 (pid:1988) Rebuilt prioritized runnable job list in 0.000s.
08/17/11 20:43:42 (pid:1988) Claiming local startd slot 1 at<130.184.158.107:9675>
08/17/11 20:43:42 (pid:1988) Negotiator gone, trying to use our local startd
08/17/11 20:43:42 (pid:1988) Completed REQUEST_CLAIM to startd MDR-U1-VM<130.184.158.107:9675> for rossetti
08/17/11 20:43:42 (pid:1988) Starting add_shadow_birthdate(1.0)
08/17/11 20:43:42 (pid:1988) Started shadow for job 1.0 on MDR-U1-VM<130.184.158.107:9675> for rossetti, (shadow pid = 6690)
08/17/11 20:43:43 (pid:1988) Shadow pid 6690 for job 1.0 reports job exit reason 100.
08/17/11 20:43:43 (pid:1988) Checking consistency running and runnable jobs
08/17/11 20:43:43 (pid:1988) Tables are consistent
08/17/11 20:43:43 (pid:1988) Rebuilt prioritized runnable job list in 0.000s. (Expedited rebuild because no match was found)
08/17/11 20:43:43 (pid:1988) Completed RELEASE_CLAIM to startd at<130.184.158.107:9675>
08/17/11 20:43:43 (pid:1988) Match record (MDR-U1-VM<130.184.158.107:9675> for rossetti, 1.0) deleted
08/17/11 20:44:03 (pid:1988) attempt to connect to<130.184.159.107:9646> failed: timed out after 20 seconds.
08/17/11 20:44:03 (pid:1988) Failed to send RESCHEDULE to negotiator MDR-BELL4164-U1: SECMAN:2004:Failed to create security session to<130.184.159.107:9646> with TCP.
-----------------------------------------------------
Manuel D. Rossetti, Ph.D., P.E.
Professor and Associate Department Head
University of Arkansas
Department of Industrial Engineering
4207 Bell Engineering Center
Fayetteville, AR 72701
Phone: (479) 575-6756
Fax: (479) 575-8431
email: rossetti@xxxxxxxx
www: www.uark.edu/~rossetti