Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
- Date: Wed, 20 Feb 2013 10:05:42 -0600
- From: Nathan Panike <nwp@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
On Tue, Feb 19, 2013 at 11:07:13PM -0500, Jason Ferrara wrote:
> When running a dagman job with approximately 10000 nodes, I'm
> seeing occasional random job failures with
>
> 02/19/13 22:16:14 Starting a VANILLA universe job with ID: 240791.0
> 02/19/13 22:16:14 IWD: /my/data/dir
> 02/19/13 22:16:14 About to exec /home/jferrara/bin/myprog.py
> /my/input/dir/infile
> 02/19/13 22:16:14 Running job as user jferrara
> 02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py):
> child failed because PRIV_USER_FINAL process was still root before
> exec()
> 02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py,
> /my/input/dir/infile, ...) failed: (errno=666666: 'Unknown error
> 666666')
> 02/19/13 22:16:15 Failed to start job, exiting
>
> in the Starter log.
>
> This is on a setup with one central manager and 6 execute systems,
> all running linux.
>
> Where and when the jobs fail seem completely random. Often I can get
> through all 10000 jobs without a failure.
>
> Does anyone have any idea whats going on or have any suggestions on
> how to debug this?
Possibly you landed on a misconfigured machine?
With DAGMan, you could insert a "RETRY" line, so that DAGMan will retry
the job, instead of simply marking a failure. This is valuable when the
failures really are random/intermittent.
http://research.cs.wisc.edu/htcondor/manual/v7.9/2_10DAGMan_Applications.html#dagman:retry
Nathan Panike