Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Parallel submission issue
- Date: Wed, 5 Sep 2007 15:34:27 +0200
- From: Nicolas GUIOT <nicolas.guiot@xxxxxxx>
- Subject: Re: [Condor-users] Parallel submission issue
Ok, here I am :
I touched the local conf file to modify EXECUTE to be on an NFS-shared directory : now I can run MPICH2 progs on 2 different boxes.
BUT, this is not an efficient solution, as everything will be written through the network, it will slow down my simulations...
On the other side I tryed some LAM/MPI jobs : they run perfectly well on several boxes (also with the initial setup, EXECUTE is on each local HD).
So, I can point more precisely the problem now :
When I run programs with MPICH _or_ MPICH2 on several computers, with "EXECUTE" located locally,it fails because one box can't find the "main" dir_XXXXX where to write the output.
If I change EXECUTE to an nfs-shared directory, runs fine (but not efficient in network resource/occupation, IMO).
Program that run with LAM/MPI run fine on several boxes, with local "EXECUTE"
What should I investigate now ? Do you have a solution to this ?
Thanks in advance
Nicolas
----------------
On Sat, 1 Sep 2007 08:40:07 +0100
Si Hammond wrote:
>
> On 31 Aug 2007, at 16:23, Nicolas GUIOT wrote:
>
> > Please people, I REALLY need help : I'm leaving this lab very soon,
> > and if I can't get this to work for MPI, it's quite sure people
> > will give up using condor, even for mono-cpu jobs, which would be
> > very sad...
> >
> > News :
> >
> > I tested MPI with an other program, and I have exactly the same
> > symptoms : 1 computer stores the output files, and each process
> > that runs on this computer finds the file, but the 2nd computer
> > can't find them.
> >
> > I would like to make a test and put the EXECUTE directory on the
> > same nfs folder. So, I tried to do this :
> > condor_config_val -rset EXECUTE=/nfs/scratch-condor/execute
> >
> > but it failed, whether I run it as root or as condor user :
> > Attempt to set configuration "EXECUTE=/nfs/scratch-condor/execute"
> > on master calisto.my.domain.fr <XXX.XXX.XXX.XXX:55829> failed.
> >
> > So :
> > 1- what's the correct solution to have files see-able by all the
> > computers
> > 2- for my tests, how can I change the EXECUTE directory to be nfs-
> > shared
>
> Nicolas, have you tried specifying the execute in the machine's
> configuration file (i.e. making every machine use the NFS-shared space)?
>
>
> >
> >
> > ++
> > Nicolas
> >
> > ----------------
> > On Thu, 30 Aug 2007 12:36:08 +0200
> > Nicolas GUIOT wrote:
> >
> >> Hi
> >>
> >> I'm trying to sumbit an MPI job to my condor pool.
> >>
> >> The problem is that when I ask it to run on 2 cpus (ie 1
> >> computer), it's fine, but when I ask for 4 CPU (ie 2 computer),
> >> one seems not to find the file to write the output.
> >>
> >> Here is the submission script :
> >> $ cat sub-cond.cmd
> >> universe = parallel
> >> executable = mp2script
> >> arguments = /nfs/opt/amber/amber9/exe/sander.MPI -O -i md.in -o
> >> TGA07.1.out -p TGA07.top -c TGA07.0.rst -r TGA07.1.rst -x
> >> TGA07.1.trj -e TGA07.1.ene
> >> machine_count = 4
> >> should_transfer_files = yes
> >> when_to_transfer_output = on_exit_OR_EVICT
> >> transfer_input_files = /nfs/opt/amber/amber9/exe/
> >> sander.MPI,md.in,TGA07.top,TGA07.0.rst
> >> Output = sanderMPI.out
> >> Error = sanderMPI.err
> >> Log = sanderMPI.log
> >> queue
> >>
> >> I'm starting the script from a directory that is nfs-shared :
> >>
> >> (/nfs/test-space/amber)$ ls
> >> blu.sh clean.sh md.in mdinfo mp2script mpd.hosts run_MD.sh
> >> sub-cond.cmd TGA07.0.rst TGA07.top
> >>
> >> The error is a typical amber error when it can't find the result
> >> file (TGA07.1.out is an output file, doesn't exist before runnning
> >> the progam.:
> >>
> >> $ more sanderMPI.err
> >> 0:
> >> 0: Unit 6 Error on OPEN: TGA07.1.out
> >>
> >> 0: [cli_0]: aborting job:
> >> 0: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> >> $
> >>
> >> So, where is my problem ? NFS ? file transfer ?
> >>
> >> Any help would be greatly appreciated :)
> >>
> >> Nicolas
> >
> >
> > ----------------------------------------------------
> > CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> >
> > Institut de Biologie Physico-Chimique
> > 13 rue Pierre et Marie Curie
> > 75005 PARIS - FRANCE
> >
> > Tel : +33 158 41 51 70
> > Fax : +33 158 41 50 26
> > ----------------------------------------------------
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
----------
----------------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE
Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
----------------------------------------------------