Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor_schedd running under the wrong user
- Date: Thu, 1 Feb 2007 23:04:28 -0600
- From: Jaime Frey <jfrey@xxxxxxxxxxx>
- Subject: Re: [Condor-users] condor_schedd running under the wrong user
On Jan 31, 2007, at 8:08 AM, Jamie Rollins wrote:
Hi, folks. I have a question that I hope someone may have some
insite into.
The other day, for a reason that is unknown to me, the
condor_schedd daemon
running on the central manager stopped running as the user
"condor", which I
believe it had been previously (and which all the other daemons are
running as),
and started running instead as a different user (user "x", say, who
uses the
pool frequently). This caused the condor_schedd daemon to freeze,
presumably
because it couldn't write to any of it's log/execute/spool files
which are only
writable as the "condor" user. Although this issue seems to be
coincident with
an upgrade of the domain/LDAP controller and nfs home directory
server, I'm not
convinced that they're related (everything else seems to be working
ok).
I was able to make the problem go away for a bit by killing all the
daemons and
flushing the spool directory for the central manger, then
restarting the
daemons. After that the schedd daemon started up as "condor".
However, after a
while, and some more use by user "x", the schedd daemon
mysteriously started
running as user "x" again. I can't find anything in the logs that
would
indicate how, when, or why this change may have happened.
Parenthetically, I'm having trouble figuring out how the daemons
are determining
what user to run as to begin with. The condor_master is started as
root, but
then immediately starts running as user "condor". All the sub-
daemons run as
"condor" (collector, negotiator, startd), except the schedd, which
mysteriously
runs as user "x" (unless the spool directory has been cleared). No
where in the
configuration files do I specify that the daemons run as user
"condor".
I found a mail to this list from last May where a user ('rok')
describes what
appears to be a very similar problem (see attached message below).
Unfortunately there weren't any replies. Has anyone else ever
experienced
anything like this? Rok, did you ever get the issue resolved, or
figure out
what was causing it? Any thoughts at all would be very much
appreciated.
The schedd starts life as root, then switches its effective uid to
'condor' for most of its life. It switches to users' uids temporarily
to perform actions as the users (access job files, starting scheduler
universe jobs, etc.). What's probably happening is that the schedd is
freezing in the middle of one of these operations. Problems talking
to the nfs server could easily cause this.
Could you set the following in your Condor config file and then send
us the end of the schedd log the next time this happens:
SCHEDD_DEBUG = D_FULLDEBUG D_COMMAND
+--------------------------------+-----------------------------------+
| Jaime Frey | I used to be a heavy gambler. |
| jfrey@xxxxxxxxxxx | But now I just make mental bets. |
| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. |
+--------------------------------+-----------------------------------+