[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec



On Feb 21, 2013, at 12:46 PM, Jason Ferrara <jason.ferrara@xxxxxxxxxxxxx> wrote:

> On 2/20/2013 11:05 AM, Nathan Panike wrote:
>> On Tue, Feb 19, 2013 at 11:07:13PM -0500, Jason Ferrara wrote:
>>> When running a dagman job with approximately 10000  nodes, I'm
>>> seeing occasional random job failures with
>>> 
>>> 02/19/13 22:16:14 Starting a VANILLA universe job with ID: 240791.0
>>> 02/19/13 22:16:14 IWD: /my/data/dir
>>> 02/19/13 22:16:14 About to exec /home/jferrara/bin/myprog.py
>>> /my/input/dir/infile
>>> 02/19/13 22:16:14 Running job as user jferrara
>>> 02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py):
>>> child failed because PRIV_USER_FINAL process was still root before
>>> exec()
>>> 02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py,
>>> /my/input/dir/infile, ...) failed: (errno=666666: 'Unknown error
>>> 666666')
>>> 02/19/13 22:16:15 Failed to start job, exiting
>>> 
>>> in the Starter log.
>>> 
>>> This is on a setup with one central manager and 6 execute systems,
>>> all running linux.
>>> 
>>> Where and when the jobs fail seem completely random. Often I can get
>>> through all 10000 jobs without a failure.
>>> 
>>> Does anyone have any idea whats going on or have any suggestions on
>>> how to debug this?
>> Possibly you landed on a misconfigured machine?
> No, which is why I'm at a loss. A given execute machine will run a bunch of jobs successfully, and then fail a job.
> 
> Is it possible there is a timeout issue in condor when querying user information? I'm using ldap+sssd for user accounts, and I've noticed that while most of the time account info is returned immediately (when running "groups <usersname>" for example) but every once in a while it takes a couple of seconds.


I believe this error can only occur if setuid() to change to the user's uid fails. This happens between clone() and exec() calls when memory is shared with the parent process, so it's difficult to report details on what went wrong.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project