Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Parallel submission issue
- Date: Fri, 31 Aug 2007 17:23:53 +0200
- From: Nicolas GUIOT <nicolas.guiot@xxxxxxx>
- Subject: Re: [Condor-users] Parallel submission issue
Please people, I REALLY need help : I'm leaving this lab very soon, and if I can't get this to work for MPI, it's quite sure people will give up using condor, even for mono-cpu jobs, which would be very sad...
News :
I tested MPI with an other program, and I have exactly the same symptoms : 1 computer stores the output files, and each process that runs on this computer finds the file, but the 2nd computer can't find them.
I would like to make a test and put the EXECUTE directory on the same nfs folder. So, I tried to do this :
condor_config_val -rset EXECUTE=/nfs/scratch-condor/execute
but it failed, whether I run it as root or as condor user :
Attempt to set configuration "EXECUTE=/nfs/scratch-condor/execute" on master calisto.my.domain.fr <XXX.XXX.XXX.XXX:55829> failed.
So :
1- what's the correct solution to have files see-able by all the computers
2- for my tests, how can I change the EXECUTE directory to be nfs-shared
++
Nicolas
----------------
On Thu, 30 Aug 2007 12:36:08 +0200
Nicolas GUIOT wrote:
> Hi
>
> I'm trying to sumbit an MPI job to my condor pool.
>
> The problem is that when I ask it to run on 2 cpus (ie 1 computer), it's fine, but when I ask for 4 CPU (ie 2 computer), one seems not to find the file to write the output.
>
> Here is the submission script :
> $ cat sub-cond.cmd
> universe = parallel
> executable = mp2script
> arguments = /nfs/opt/amber/amber9/exe/sander.MPI -O -i md.in -o TGA07.1.out -p TGA07.top -c TGA07.0.rst -r TGA07.1.rst -x TGA07.1.trj -e TGA07.1.ene
> machine_count = 4
> should_transfer_files = yes
> when_to_transfer_output = on_exit_OR_EVICT
> transfer_input_files = /nfs/opt/amber/amber9/exe/sander.MPI,md.in,TGA07.top,TGA07.0.rst
> Output = sanderMPI.out
> Error = sanderMPI.err
> Log = sanderMPI.log
> queue
>
> I'm starting the script from a directory that is nfs-shared :
>
> (/nfs/test-space/amber)$ ls
> blu.sh clean.sh md.in mdinfo mp2script mpd.hosts run_MD.sh sub-cond.cmd TGA07.0.rst TGA07.top
>
> The error is a typical amber error when it can't find the result file (TGA07.1.out is an output file, doesn't exist before runnning the progam.:
>
> $ more sanderMPI.err
> 0:
> 0: Unit 6 Error on OPEN: TGA07.1.out
>
> 0: [cli_0]: aborting job:
> 0: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> $
>
> So, where is my problem ? NFS ? file transfer ?
>
> Any help would be greatly appreciated :)
>
> Nicolas
----------------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE
Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
----------------------------------------------------