Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] A couple of MPI-universe oddities
- Date: Tue, 03 May 2005 12:01:46 +0100
- From: Mark Calleja <M.Calleja@xxxxxxxxxxxxxxx>
- Subject: [Condor-users] A couple of MPI-universe oddities
Hi,
I'm experienceing a couple of problems running MPI universe jobs,
depending on whether I try to use the underlying NFS file system (which
gives rise to one error) or not (which leads to a different error). I'll
mention the two separately:
1) Problem 1 - No NFS case
In this case I set the following in the submit script:
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
and all goes well with a simple "hello world" program for jobs that use
<= 6 processors. Any more than that and some processors do not return
their output. It's not always the same processors, and not always the
same number of processors. The ShadowLog always has:
5/3 11:43:41 (25.0) (30724): Job 25.0 terminated: exited with status 0
5/3 11:43:42 (25.0) (30724): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 100
The StarterLogs all exit with "Status 0" and the job logfile verifies
that some nodes returned zero bytes back. I stress that the raw mpi job,
when run with mpirun, works perfectly well for an arbitrary number of nodes.
2) Problem 2 - NFS case
Now I set the following in the submit script:
should_transfer_files = IF_NEEDED
and for any number of processors in the job I get the following in the
job logfile:
007 (026.000.000) 05/03 11:51:08 Shadow exception!
Error from starter on node2--srl.grid.private.cam.ac.uk: Failed
to open standard output file '/home/mcal00/mpi/outfile.0': Permission
denied (errno 13)
I have no problem when running the jobs via mpirun as the dedicated
condor execute user. I have however come across an article that a
possible source of the ``Permission denied.'' message is when one uses
the su command to change effective user id on some systems that use the
ch_p4 device. This is pretty much the Condor-MPI setup, right? /home is
nfs exported with no_root_squash across the nodes, and both root and the
dedicated condor user have passwordless access set up.
Help to alleviate either of the above problems would be much appreciated!
Cheers,
Mark