[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI jobs in the vanilla universe



On Wed, Oct 27, 2004 at 08:24:14AM -0700, David E. Konerding wrote:
> Erik Paulson wrote:
> 
> >On Tue, Oct 26, 2004 at 03:30:02PM -0700, David E. Konerding wrote:
> > 
> >
> >>Hi,
> >>
> >>I am interested in running an MPI job on my cluster (which is already 
> >>running Condor 6.6.6), but within the vanilla
> >>universe (there are some restrictions to the MPI universe setup which we 
> >>cannot abide by).
> >>
> >>   
> >>
> >
> >The vanilla universe has all of the same restrictions as the MPI universe 
> >(they're nearly identical code-bases) - what is giving you trouble?
> >
> > 
> >
> From the manual:
> 
> > Administratively, Condor must be congured such that resources 
> (machines) running MPI jobs are
> > dedicated.
> 
> Not sure what that means, but it sounds to me like we would have to 
> statically configure nodes to run
> MPI jobs, which would be fully exclusive of vanilla jobs (following the 
> docs from the user MPI section, 2.10 to the admion MPI section, 3.10.10, 
> shows that you have to set up a dedicated scheduler that manages dedicated
> resources).  We've always used the pool as a combination of MPI and 
> single process jobs, so this is undesireable.
> 

No, that's not what we mean by dedicated - dedicated to us means "only run
Condor jobs - not desktops that will be interrupted by returning users".
MPI universe jobs are managed such that if any one processor is lost,
we abort the job on all processors - so it's a bad idea to have MPI
universe jobs run on machines that might be evicted (you could if you
_really_ wanted to, though)

See the Wright "Cheap cycles from the desktop to the dedicated cluster"
paper at http://www.cs.wisc.edu/condor/publications.html#scheduling -
the whole idea of Condor and MPI is that we can run vanilla jobs on a
"dedicated" MPI cluster.

> Also:
> > This leads to a further restriction that jobs submitted to execute 
> under the MPI
> > universe (with dedicated machines) must be submitted from the machine 
> running as the dedicated
> > scheduler.
> 
> We would normally be starting these jobs from a laptop, far away from 
> pool.

OK - one bad thing is that the submit machine needs to stay connected during
the lifetime of the job, so submitting from a laptop isn't always a good
idea.

> That laptop is running Windows, the pool is running Linux.  So 
> this is a constraint we cannot satisfy; we don't want to have to ssh 
> into the pool to start the job.
> 

This is also a problem - you cannot currently cross submit MPI jobs. 
Windows MPI must be submitted from Windows, Unix MPI must be submitted from
Unix (you can cross submit between Unix platfroms - ie submit Linux jobs
from Solaris)

-Erik