-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Tom Kobialka
Sent: Tuesday, 19 July 2005 2:06 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Condor 6.6.10 on FC4 Configuration Woes
Hi,
I'm trying to configure Condor 6.6.10 on Fedora Core 4. I've read
through the user manual (including trouble shooting), FAQ, searched
through mailing lists and googled.. A description of my
problem follows.
The Central Manager is:
uname -a
Linux adac.anu.edu.au 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2
23:08:39 EDT
2005 i686 i686 i386 GNU/Linux
The first time I do a condor_status on the central manager
I get the
following:
7/19 15:15:34 (Sent 0 ads in response to query)
7/19 15:15:36 DC_AUTHENTICATE: attempt to open invalid session
adac:3856:1121748123:13, failing.
I cannot reproduce this error after the first time. Every
other time
condor_status comes up as blank. And CollectorLog displays:
7/19 15:39:04 Got QUERY_STARTD_ADS
7/19 15:39:04 (Sent 0 ads in response to query)
I have a firewall configured on the Central manager, incoming ports
are open are 9614, 9618 and an arbitrary port range (65000-65255). As
described in "Kewley J, Using Condor effectively in the presence of
Personal Firewalls, Oct 7 2004,
My condor_config file looks as follows:
CONDOR_HOST = adac.anu.edu.au
LOCAL_DIR = $(TILDE)
UID_DOMAIN = adac.anu.edu.au
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
HOSTALLOW_READ = 150.203.*.*, adac.anu.edu.au, 192.168.1.101,
192.168.1.102, 192.168.1.103, 192.168.1.104, 192.168.1.105,
192.168.1.106, 192.168.1.107, 192.168.1.108
HOSTALLOW_WRITE = 150.203.*.*, adac.anu.edu.au, 192.168.1.101,
192.168.1.102, 192.168.1.103, 192.168.1.104, 192.168.1.105,
192.168.1.106, 192.168.1.107, 192.168.1.108
Condor-config.local:
HAS_FIREWALL = TRUE
STARTD_EXPRS = HAS_FIREWALL
COLLECTOR_NAME =
FILESYSTEM_DOMAIN = anu.edu.au
SUSPEND =
LOCK = /tmp/condor-lock.$(HOSTNAME)0.203045580144138
JAVA_MAXHEAP_ARGUMENT =
CONDOR_ADMIN = root@localhost
START =
MAIL = /bin/mail
RELEASE_DIR = /condor
DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDD
COLLECTOR = $(SBIN)/condor_collector
PREEMPT =
UID_DOMAIN = anu.edu.au
NEGOTIATOR = $(SBIN)/condor_negotiator
JAVA = /usr/bin/java
VACATE =
CONDOR_HOST = adac.anu.edu.au
CONDOR_IDS = 503.503
LOCAL_DIR = /condor-local
My Execute clients are also running FC4.
uname -a
Linux node1 2.6.11-1.1369_FC4 #1 Thu Jun 2 22:55:56 EDT 2005
i686 i686
i386 GNU/Linux
When I do a condor_status I get the following:
[root@node2 ~]# condor_status
CEDAR:6001:Failed to connect to <150.203.48.125:9618>
Error: Couldn't contact the condor_collector on adac.anu.edu.au.
Extra Info: the condor_collector is a process that runs on
the central manager of your Condor pool and collects the
status of all the machines and jobs in the Condor pool. The
condor_collector might not be running, it
might
be refusing to communicate with you, there might be a network
problem, or there may be some other problem. Check with your
system administrator to
fix
this problem.
If you are the system administrator, check that the
condor_collector is running on adac.anu.edu.au, check the
HOSTALLOW configuration in your condor_config, and check the
MasterLog and CollectorLog files in your log directory for
possible clues as to why the condor_collector is not
responding. Also see the Troubleshooting section of the
manual. [root@node2 ~]#
StartLog looks as follows:
7/19 16:35:16 ******************************************************
7/19 16:35:16 ** condor_startd (CONDOR_STARTD) STARTING UP
7/19 16:35:16 ** /condor/sbin/condor_startd
7/19 16:35:16 ** $CondorVersion: 6.6.10 Jun 13 2005 $
7/19 16:35:16 ** $CondorPlatform: I386-LINUX_RH9 $
7/19 16:35:16 ** PID = 2930
7/19 16:35:16 ******************************************************
7/19 16:35:16 Using config file: /condor/etc/condor_config
7/19 16:35:16 Using local config files:
/condor-local/condor_config.local 7/19 16:35:16 DaemonCore:
Command Socket at <127.0.0.1:33534> 7/19 16:35:16 WARNING:
Condor is running on the loopback address (127.0.0.1)
7/19 16:35:16 of this machine, and is not visible to
other hosts!
7/19 16:35:16 This may be due to a misconfigured
/etc/hosts file.
7/19 16:35:16 Please make sure your hostname is not
listed on the
7/19 16:35:16 same line as localhost in /etc/hosts.
7/19 16:35:16 Error computing physical memory with calc_phys_mem().
MEMORY parameter not defined in config file.
Try setting MEMORY to the number of
megabytes of RAM. 7/19 16:35:16 ERROR "Can't compute physical
memory." at line 60 in file
ResAttributes.C
And MasterLog repeats:
7/19 16:35:26 Can't send UPDATE_MASTER_AD to collector
adac.anu.edu.au
<150.203.48.125:9618>: Failed to send UDP update command to
collector 7/19 16:35:26 The STARTD (pid 2930) exited with
status 4 7/19 16:35:26 restarting /condor/sbin/condor_startd
in 2057 seconds 7/19 16:35:26 Can't connect to
<150.203.48.125:9618>:0, errno = 22 7/19 16:35:26 Will keep
trying for 10 seconds... 7/19 16:35:36 Connect failed for 10
seconds; returning FALSE 7/19 16:35:36 ERROR: SECMAN:2003:TCP
connection to <150.203.48.125:9618> failed
I can ping and telnet to 9618 on the central manager from
all client
nodes.
Condor_config on the clients looks as follows:
CONDOR_HOST = adac.anu.edu.au
LOCAL_DIR = $(TILDE)
UID_DOMAIN = adac.anu.edu.au
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
HOSTALLOW_READ = *.anu.edu.au
HOSTALLOW_WRITE = *.anu.edu.au
and Condor-config.local:
COLLECTOR_NAME = adac.anu.edu.au
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
SUSPEND =
LOCK = /tmp/condor-lock.$(HOSTNAME)0.0746330738338692
JAVA_MAXHEAP_ARGUMENT =
CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxx localhost
START =
MAIL = /bin/mail
RELEASE_DIR = /condor
DAEMON_LIST = MASTER,STARTD
COLLECTOR = $(SBIN)/condor_collector
PREEMPT =
UID_DOMAIN = localdomain localhost
NEGOTIATOR = $(SBIN)/condor_negotiator
JAVA = /usr/bin/java
VACATE =
CONDOR_HOST = adac.anu.edu.au
CONDOR_IDS = 503.503
LOCAL_DIR = /condor-local
I am running condor as root. The ip addresses of my clients are
included in HOSTALLOW on the central manager. I am logging
all rejected
packets on the firewall and nothing seems to be rejected.
This could be
a network issue, but then it seems unlikely since I can
telnet and ping
the central manager fine.
I suspect I have probably misconfigured something, but
don't have the
adequate experience with condor to be able to identify the
problem. Any
help would be greatly appreciated.
Cheers
Tom
--
Tom Kobialka
HPC Programmer (Data Grids)
APAC National Facility
Australian National University, Canberra, ACT 2600
Tom.Kobialka@xxxxxxxxxx
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users