On Tue, Feb 19, 2013 at 11:07:13PM -0500, Jason Ferrara wrote:
When running a dagman job with approximately 10000 nodes, I'm
seeing occasional random job failures with
02/19/13 22:16:14 Starting a VANILLA universe job with ID: 240791.0
02/19/13 22:16:14 IWD: /my/data/dir
02/19/13 22:16:14 About to exec /home/jferrara/bin/myprog.py
/my/input/dir/infile
02/19/13 22:16:14 Running job as user jferrara
02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py):
child failed because PRIV_USER_FINAL process was still root before
exec()
02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py,
/my/input/dir/infile, ...) failed: (errno=666666: 'Unknown error
666666')
02/19/13 22:16:15 Failed to start job, exiting
in the Starter log.
This is on a setup with one central manager and 6 execute systems,
all running linux.
Where and when the jobs fail seem completely random. Often I can get
through all 10000 jobs without a failure.
Does anyone have any idea whats going on or have any suggestions on
how to debug this?
Possibly you landed on a misconfigured machine?