We have a linux condor pool that runs the vanilla universe
on a shared file system across several servers. We are using condor version
6.7.20. Occasionally a user will submit a job that starts, gets
evicted and makes several tries at one machine before successfully starting at
another machine. In almost all of these cases the culprit is due to PermissionDenied
on the output files. 2/15 17:28:08 (fd:13) (pid:23771) Starting a VANILLA
universe job with ID: 16935.0 2/15 17:28:08 (fd:13) (pid:23771) In OsProc::OsProc() 2/15 17:28:08 (fd:13) (pid:23771) 2/15 17:28:08 (fd:13) (pid:23771) 2/15 17:28:08 (fd:13) (pid:23771) 2/15 17:28:08 (fd:13) (pid:23771) in VanillaProc::StartJob() 2/15 17:28:08 (fd:13) (pid:23771) in OsProc::StartJob() 2/15 17:28:08 (fd:13) (pid:23771) IWD:
/work/mb_apps/fi/temp.mb.301.35/mbpa-3.0.1/JBoss-2.4.3_Tomcat-3.2.3/jboss/bin 2/15 17:28:08 (fd:13) (pid:23771) PRIV_CONDOR -->
PRIV_USER at os_proc.C:232 2/15 17:28:08 (fd:14) (pid:23771) Input file: /dev/null 2/15 17:28:20 (fd:14) (pid:23771) Failed to open
'/work/pre3/fes12/RetroDevelopment/repository/projects/Master/jobs/work/fes13modelEval-3/logs/.condor/co\ ndor.out' as standard output: Permission
denied (errno 13) 2/15 17:28:20 (fd:14) (pid:23771) Doing CONDOR_ulog 2/15 17:28:20 (fd:14) (pid:23771) Failed to open
'/work/pre3/fes12/RetroDevelopment/repository/projects/Master/jobs/work/fes13modelEval-3/logs/.condor/co\ ndor.err' as standard error: Permission
denied (errno 13) 2/15 17:28:20 (fd:14) (pid:23771) Doing CONDOR_ulog 2/15 17:28:20 (fd:13) (pid:23771) Failed to open some/all of the std files... 2/15 17:28:20 (fd:13) (pid:23771) Aborting OsProc::StartJob. 2/15 17:28:20 (fd:13) (pid:23771) PRIV_USER -->
PRIV_CONDOR at os_proc.C:257 2/15 17:28:20 (fd:13) (pid:23771) Failed to start job,
exiting 2/15 17:28:20 (fd:13) (pid:23771) ShutdownFast all jobs. 2/15 17:28:20 (fd:13) (pid:23771) Got ShutdownFast when no
jobs running. These files are created on the shared system by condor, and
when the user logs on they are able to modify the files themselves. Furthermore, condor can generally write the log as soon as
it tries a different machine. (The machine that produces the Permission Denied
error and the machine that the job finally runs change from run to run). This error occurs seemingly at random, as the user can run
several similar jobs and only a small subset will have this problem. Can anyone suggest what I should look at or do to better
understand why these permission denied errors are occurring? Is there any information I didn’t include in this
email that could help you out? Thanks, This email and any files transmitted with it are confidential, proprietary and intended solely for the individual or entity to whom they are addressed. If you have received this email in error please delete it immediately. |