Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] starter process exits
- Date: Wed, 19 Jan 2005 13:55:59 -0600 (CST)
- From: John-Paul Robinson <jpr@xxxxxxx>
- Subject: [Condor-users] starter process exits
Hi,
I've set up a condor pool (6.6.7) on our desktop x86 boxes with a fc3
master and suse9.1 execute nodes (kernel 2.6.4). The systems share an nfs
file system and have a common uid_domain. I'm able to start the condor
processes (already fixed the /proc/meminfo problem). The processes are
started as root but run as the user condor. The execute nodes get their
/home/condor served up by NFS and the dirs are auto-mounted. My global
condor_config is in /opt/condor/etc/condor_config and there is a symlink
from /home/condor/condor_config to this file.
I'm seeing some strange behavior both when I start up condor_master and
when I submit jobs to the pool. In the case of condor_master, if I start
this process without first doing an 'ls /home/condor' it dies with a
complaint about not having CONDOR_CONFIG set, not being able to find
/etc/condor/condor_config, or not being able to find
/local/condor/condor_config. The complaint also mentions not finding
~/condor. When I trace the condor_master with strace, however, it doesn't
look like an open() attempt is ever made on ~/condor_config. Eventhough
df shows /home/condor as already mounted, if I 'ls /home/condor', however,
it succeeds in checking for and finding this directory. It seems there is
some reason condor is not even attempting to open
/home/condor/condor_config.
This trouble follows me to the startd process. If I submit a job and
monitor the StartLog, it frequently shows that starter exited with status
1. If I strace on startd, i find that the child is dieing for the same
reason mentioned above. Again, no attempt is even made to check
/home/condor/condor_config. It just tries to open the first two. So there
seems to be a problem with condor not wanting to check for this file.
I can "fix" this problem by either doing the ls or setting CONDOR_CONFIG
explicitly.
Does anyone have insights into this?