Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Shadow processes not ending
- Date: Tue, 12 Dec 2006 11:27:59 +0000 (GMT)
- From: Adam Thorn <alt36@xxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Shadow processes not ending
On Thu, 7 Dec 2006, Todd Tannenbaum wrote:
Re the below - what version and platform are u on? I will guess v6.8.x
and Linux, but if I guessed wrong please tell me.
Yup, you guessed right - 6.8.1 on Linux.
Does the below only happen when you have lots of running jobs, or even
with just a few, or even with just one?
There are generally a few tens of jobs running on my pool, so I can't say
right now what the behaviour is with just one or two jobs running. I'll
try to investigate that further when a convenient opportunity presents
itself.. I've also noticed the following error sometimes pops up in the
ShadowLog at the same time as the FileLock errors, if it helps:
12/12 00:13:35 (201.18) (22856): ERROR "Can no longer talk to
condor_starter <172.24.89.152:9625>" at line 123 in file NTreceivers.C
If the above does not help, or you cannot configure that way cuz of
diskless nodes, you could get rid of shadowlog locking altogether by
having each shadow write into its own log file instead of sharing one.
To do this, remove (or comment out) SHADOW_LOCK and then change
SHADOW_LOG to be something like
SHADOW_LOG=/somewhere/shadowlog.$(pid)
All log and lock files are on a local disk, with the exception of the job
log files (ie the "Log" file in the submit file) which is on NFS.
Basically, our setup is that Condor itself is installed locally on each
machine whilst all users' files are on NFS (which thus includes things
like the submitted executable, and input/output files for each job). The
behaviour seems to be the same for both standard and vanilla jobs, which
are all we run. Could it be the log files for the individual jobs that are
causing the problem? I've tried your ShadowLog.$(pid) suggestion, but that
didn't seem to change anything.
Thanks for the suggestions.
Adam