Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
- Date: Tue, 19 Feb 2013 23:07:13 -0500
- From: Jason Ferrara <jason.ferrara@xxxxxxxxxxxxx>
- Subject: [HTCondor-users] child failed because PRIV_USER_FINAL process was still root before exec
When running a dagman job with approximately 10000 nodes, I'm seeing
occasional random job failures with
02/19/13 22:16:14 Starting a VANILLA universe job with ID: 240791.0
02/19/13 22:16:14 IWD: /my/data/dir
02/19/13 22:16:14 About to exec /home/jferrara/bin/myprog.py
/my/input/dir/infile
02/19/13 22:16:14 Running job as user jferrara
02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py): child
failed because PRIV_USER_FINAL process was still root before exec()
02/19/13 22:16:15 Create_Process(/home/jferrara/bin/myprog.py,
/my/input/dir/infile, ...) failed: (errno=666666: 'Unknown error 666666')
02/19/13 22:16:15 Failed to start job, exiting
in the Starter log.
This is on a setup with one central manager and 6 execute systems, all
running linux.
Where and when the jobs fail seem completely random. Often I can get
through all 10000 jobs without a failure.
Does anyone have any idea whats going on or have any suggestions on how
to debug this?
Thanks