Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor 6.6.10 on FC4 Configuration Woes
- Date: Tue, 19 Jul 2005 16:05:32 +1000
- From: Tom Kobialka <u3356514@xxxxxxxxxx>
- Subject: [Condor-users] Condor 6.6.10 on FC4 Configuration Woes
Hi,
I'm trying to configure Condor 6.6.10 on Fedora Core 4. I've read
through the user manual (including trouble shooting), FAQ, searched
through mailing lists and googled.. A description of my problem follows.
The Central Manager is:
uname -a
Linux adac.anu.edu.au 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT
2005 i686 i686 i386 GNU/Linux
The first time I do a condor_status on the central manager I get the
following:
7/19 15:15:34 (Sent 0 ads in response to query)
7/19 15:15:36 DC_AUTHENTICATE: attempt to open invalid session
adac:3856:1121748123:13, failing.
I cannot reproduce this error after the first time. Every other time
condor_status comes up as blank. And CollectorLog displays:
7/19 15:39:04 Got QUERY_STARTD_ADS
7/19 15:39:04 (Sent 0 ads in response to query)
I have a firewall configured on the Central manager, incoming ports
are open are 9614, 9618 and an arbitrary port range (65000-65255). As
described in "Kewley J, Using Condor effectively in the presence of
Personal Firewalls, Oct 7 2004,
My condor_config file looks as follows:
CONDOR_HOST = adac.anu.edu.au
LOCAL_DIR = $(TILDE)
UID_DOMAIN = adac.anu.edu.au
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
HOSTALLOW_READ = 150.203.*.*, adac.anu.edu.au, 192.168.1.101,
192.168.1.102, 192.168.1.103, 192.168.1.104, 192.168.1.105,
192.168.1.106, 192.168.1.107, 192.168.1.108
HOSTALLOW_WRITE = 150.203.*.*, adac.anu.edu.au, 192.168.1.101,
192.168.1.102, 192.168.1.103, 192.168.1.104, 192.168.1.105,
192.168.1.106, 192.168.1.107, 192.168.1.108
Condor-config.local:
HAS_FIREWALL = TRUE
STARTD_EXPRS = HAS_FIREWALL
COLLECTOR_NAME =
FILESYSTEM_DOMAIN = anu.edu.au
SUSPEND =
LOCK = /tmp/condor-lock.$(HOSTNAME)0.203045580144138
JAVA_MAXHEAP_ARGUMENT =
CONDOR_ADMIN = root@localhost
START =
MAIL = /bin/mail
RELEASE_DIR = /condor
DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDD
COLLECTOR = $(SBIN)/condor_collector
PREEMPT =
UID_DOMAIN = anu.edu.au
NEGOTIATOR = $(SBIN)/condor_negotiator
JAVA = /usr/bin/java
VACATE =
CONDOR_HOST = adac.anu.edu.au
CONDOR_IDS = 503.503
LOCAL_DIR = /condor-local
My Execute clients are also running FC4.
uname -a
Linux node1 2.6.11-1.1369_FC4 #1 Thu Jun 2 22:55:56 EDT 2005 i686 i686
i386 GNU/Linux
When I do a condor_status I get the following:
[root@node2 ~]# condor_status
CEDAR:6001:Failed to connect to <150.203.48.125:9618>
Error: Couldn't contact the condor_collector on adac.anu.edu.au.
Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines and
jobs in the Condor pool. The condor_collector might not be running, it
might
be refusing to communicate with you, there might be a network problem, or
there may be some other problem. Check with your system administrator to
fix
this problem.
If you are the system administrator, check that the condor_collector is
running on adac.anu.edu.au, check the HOSTALLOW configuration in your
condor_config, and check the MasterLog and CollectorLog files in your log
directory for possible clues as to why the condor_collector is not
responding. Also see the Troubleshooting section of the manual.
[root@node2 ~]#
StartLog looks as follows:
7/19 16:35:16 ******************************************************
7/19 16:35:16 ** condor_startd (CONDOR_STARTD) STARTING UP
7/19 16:35:16 ** /condor/sbin/condor_startd
7/19 16:35:16 ** $CondorVersion: 6.6.10 Jun 13 2005 $
7/19 16:35:16 ** $CondorPlatform: I386-LINUX_RH9 $
7/19 16:35:16 ** PID = 2930
7/19 16:35:16 ******************************************************
7/19 16:35:16 Using config file: /condor/etc/condor_config
7/19 16:35:16 Using local config files: /condor-local/condor_config.local
7/19 16:35:16 DaemonCore: Command Socket at <127.0.0.1:33534>
7/19 16:35:16 WARNING: Condor is running on the loopback address (127.0.0.1)
7/19 16:35:16 of this machine, and is not visible to other hosts!
7/19 16:35:16 This may be due to a misconfigured /etc/hosts file.
7/19 16:35:16 Please make sure your hostname is not listed on the
7/19 16:35:16 same line as localhost in /etc/hosts.
7/19 16:35:16 Error computing physical memory with calc_phys_mem().
MEMORY parameter not defined in config file.
Try setting MEMORY to the number of megabytes of RAM.
7/19 16:35:16 ERROR "Can't compute physical memory." at line 60 in file
ResAttributes.C
And MasterLog repeats:
7/19 16:35:26 Can't send UPDATE_MASTER_AD to collector adac.anu.edu.au
<150.203.48.125:9618>: Failed to send UDP update command to collector
7/19 16:35:26 The STARTD (pid 2930) exited with status 4
7/19 16:35:26 restarting /condor/sbin/condor_startd in 2057 seconds
7/19 16:35:26 Can't connect to <150.203.48.125:9618>:0, errno = 22
7/19 16:35:26 Will keep trying for 10 seconds...
7/19 16:35:36 Connect failed for 10 seconds; returning FALSE
7/19 16:35:36 ERROR:
SECMAN:2003:TCP connection to <150.203.48.125:9618> failed
I can ping and telnet to 9618 on the central manager from all client
nodes.
Condor_config on the clients looks as follows:
CONDOR_HOST = adac.anu.edu.au
LOCAL_DIR = $(TILDE)
UID_DOMAIN = adac.anu.edu.au
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
HOSTALLOW_READ = *.anu.edu.au
HOSTALLOW_WRITE = *.anu.edu.au
and Condor-config.local:
COLLECTOR_NAME = adac.anu.edu.au
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
SUSPEND =
LOCK = /tmp/condor-lock.$(HOSTNAME)0.0746330738338692
JAVA_MAXHEAP_ARGUMENT =
CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxx localhost
START =
MAIL = /bin/mail
RELEASE_DIR = /condor
DAEMON_LIST = MASTER,STARTD
COLLECTOR = $(SBIN)/condor_collector
PREEMPT =
UID_DOMAIN = localdomain localhost
NEGOTIATOR = $(SBIN)/condor_negotiator
JAVA = /usr/bin/java
VACATE =
CONDOR_HOST = adac.anu.edu.au
CONDOR_IDS = 503.503
LOCAL_DIR = /condor-local
I am running condor as root. The ip addresses of my clients are
included in HOSTALLOW on the central manager. I am logging all rejected
packets on the firewall and nothing seems to be rejected. This could be
a network issue, but then it seems unlikely since I can telnet and ping
the central manager fine.
I suspect I have probably misconfigured something, but don't have the
adequate experience with condor to be able to identify the problem. Any
help would be greatly appreciated.
Cheers
Tom
--
Tom Kobialka
HPC Programmer (Data Grids)
APAC National Facility
Australian National University, Canberra, ACT 2600
Tom.Kobialka@xxxxxxxxxx