[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] issues with negotiator and configuration



These instructions are for Fedora instead of Ubuntu, but once you have the bits the rest is hopefully the same -

http://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/

http://spinningmatt.wordpress.com/2011/06/21/getting-started-multiple-node-condor-pool-with-firewalls/

And if the Ubuntu package doesn't use a config directory -

http://spinningmatt.wordpress.com/2011/06/16/customizing-condor-configuration-local_config_fil-vs-local_config_dir/

Best,


matt

On 08/18/2011 07:14 PM, Manuel Rossetti wrote:
Hello,

I've installed condor on 3 Ubuntu based machines using the latest release of condor from the debian repository.  This was based on using apt-get install condor after updating my local sources.  Everything installed without problems. That install uses the personal condor configuration which must be customized.

I have 1 machine that is the host and 2 machines that are submit/execute.  I specialized the local config file. Here it is for the submit machine.  I used ufw to open the ports needed.

I don't think that the connection to the negotiator is working correctly.  When I execute condor_q it reports the queue, quickly.  When I execute condor_q -analyze,  it takes a very long time and then reports a communication error to the negotiation daemon.  I've also included sample output from what I think are the relevant log files.  I've searched on this issue using google etc., while there were others that had something similar I wasn't able to find a clear/clean resolution.  I'm wondering if anyone has some insights on this.

setti@MDR-U1-VM:~$ condor_q -analyze
Error: Could not connect to negotiator (MDR-BELL4164-U1)

-----------------------------------------

##  What machine is your central manager?

#CONDOR_HOST = $(FULL_HOSTNAME)
CONDOR_HOST = 130.184.159.107

ALLOW_ADMINISTRATOR = $(CONDOR_HOST),$(IP_ADDRESS),$(FULL_HOSTNAME)

## Pool's short description

COLLECTOR_NAME = UA-INEG-CONDOR at $(CONDOR_HOST)

CONDOR_ADMIN = rossetti@xxxxxxxx

ALLOW_READ = $(CONDOR_HOST), $(FULL_HOSTNAME), $(IP_ADDRESS), 130.184.*.*,130.184.159.107, 130.184.158.107,130.184.159.6
ALLOW_WRITE = $(CONDOR_HOST), $(FULL_HOSTNAME), $(IP_ADDRESS), 130.184.*.*,130.184.159.107, 130.184.158.107,130.184.159.6

HIGHPORT = 9700
LOWPORT = 9600

NETWORK_INTERFACE = 130.184.158.107

##  When is this machine willing to start a job?

START = TRUE


##  When to suspend a job?

SUSPEND = FALSE


##  When to nicely stop a job?
##  (as opposed to killing it instantaneously)

PREEMPT = FALSE


##  When to instantaneously kill a preempting job
##  (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.
##  The list is a comma or space separated list of subsystem names

DAEMON_LIST = MASTER, SCHEDD, STARTD

------------------------------------------------------


08/17/11 20:43:42 Communicating with shadow<130.184.158.107:9609?noUDP>
08/17/11 20:43:42 Submitting machine is "uaf38185.ddns.uark.edu"


8/17/11 20:43:42 ERROR: the submitting host claims to be in our UidDomain (MDR-U1-VM), yet its hostname (uaf38185.ddns.uark.edu) does not match.  If the above hostname is actually an IP address, Condor could not perform a reverse DNS lookup to convert the IP back into a name.  To solve this problem, you can either correctly configure DNS to allow the reverse lookup, or you can enable TRUST_UID_DOMAIN in your condor configuration.

08/17/11 20:43:42 IPVERIFY: unable to resolve IP address of MDR-U1-VMs
08/17/11 20:43:42 IPVERIFY: unable to resolve IP address of MDR-U1-VMs

08/17/11 20:23:42 (pid:1988) Sent ad to central manager for rossetti@MDR-U1-VM
08/17/11 20:23:42 (pid:1988) Sent ad to 1 collectors for rossetti@MDR-U1-VM
08/17/11 20:24:03 (pid:1988) attempt to connect to<130.184.159.107:9646>  failed: timed out after 20 seconds.
08/17/11 20:24:03 (pid:1988) Failed to send RESCHEDULE to negotiator MDR-BELL4164-U1: SECMAN:2004:Failed to create security session to<130.184.159.107:9646>  with TCP.
|SECMAN:2003:TCP connection to<130.184.159.107:9646>  failed.

08/17/11 20:43:42 (pid:1988) Sent ad to central manager for rossetti@MDR-U1-VM
08/17/11 20:43:42 (pid:1988) Sent ad to 1 collectors for rossetti@MDR-U1-VM
08/17/11 20:43:42 (pid:1988) Haven't heard from negotiator, trying to claim local startd @<130.184.158.107:9675>
08/17/11 20:43:42 (pid:1988) Checking consistency running and runnable jobs
08/17/11 20:43:42 (pid:1988) Tables are consistent
08/17/11 20:43:42 (pid:1988) Rebuilt prioritized runnable job list in 0.000s.
08/17/11 20:43:42 (pid:1988) Claiming local startd slot 1 at<130.184.158.107:9675>
08/17/11 20:43:42 (pid:1988) Negotiator gone, trying to use our local startd
08/17/11 20:43:42 (pid:1988) Completed REQUEST_CLAIM to startd MDR-U1-VM<130.184.158.107:9675>  for rossetti
08/17/11 20:43:42 (pid:1988) Starting add_shadow_birthdate(1.0)
08/17/11 20:43:42 (pid:1988) Started shadow for job 1.0 on MDR-U1-VM<130.184.158.107:9675>  for rossetti, (shadow pid = 6690)
08/17/11 20:43:43 (pid:1988) Shadow pid 6690 for job 1.0 reports job exit reason 100.
08/17/11 20:43:43 (pid:1988) Checking consistency running and runnable jobs
08/17/11 20:43:43 (pid:1988) Tables are consistent
08/17/11 20:43:43 (pid:1988) Rebuilt prioritized runnable job list in 0.000s.  (Expedited rebuild because no match was found)
08/17/11 20:43:43 (pid:1988) Completed RELEASE_CLAIM to startd at<130.184.158.107:9675>
08/17/11 20:43:43 (pid:1988) Match record (MDR-U1-VM<130.184.158.107:9675>  for rossetti, 1.0) deleted
08/17/11 20:44:03 (pid:1988) attempt to connect to<130.184.159.107:9646>  failed: timed out after 20 seconds.
08/17/11 20:44:03 (pid:1988) Failed to send RESCHEDULE to negotiator MDR-BELL4164-U1: SECMAN:2004:Failed to create security session to<130.184.159.107:9646>  with TCP.

-----------------------------------------------------
Manuel D. Rossetti, Ph.D., P.E.
Professor and Associate Department Head
University of Arkansas
Department of Industrial Engineering
4207 Bell Engineering Center
Fayetteville, AR 72701
Phone: (479) 575-6756
Fax: (479) 575-8431
email: rossetti@xxxxxxxx
www: www.uark.edu/~rossetti