Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] mpi and dedicated scheduler configuration
- Date: Mon, 28 Jun 2004 06:43:10 -0700 (PDT)
- From: Mike Busch <zenlc2000@xxxxxxxxx>
- Subject: Re: [Condor-users] mpi and dedicated scheduler configuration
Hi Vahid,
Thank you for that report. At this point I'm more than a little
frustrated. Originally the project was to have Globus submitting jobs
to a Condor pool of MPI machines but I find I can't run Globus due to
security restrictions and now I find out that Condor won't submit to
Windows. Pretty effectively kills my project.
I'll work with it a bit and let you know if I get any different
results.
Mike
--- Vahid Pazirandeh <vpaziran@xxxxxxxxx> wrote:
> Hello,
>
> I used to have a very similar setup as you Mike. Then I found a bug
> that
> arises when you use a Linux submitter to Windows execute nodes for
> MPI jobs.
>
> I had a Linux server acting as the central manager and submitter.
> All my
> execute nodes were Windows. I compiled my code with cygwin. My code
> ran fine
> when I ran it with the GUI NT mpirun program included from the
> package at
> http://www-unix.mcs.anl.gov/mpi/mpich/. I could run my MPI code on
> Condor so
> long as machine_count=1. If it was anything greater, Condor would
> crash and
> burn.
>
> I sent a bug report to condor-admin around November 2003. It turned
> out to be
> a larger bug in Condor then they had expected and it is not fixed to
> this day.
> It is a documented bug. I saw it documented some time in 2004:
> http://www.cs.wisc.edu/condor/manual/v6.6/8_2Stable_Release.html.
>
> -- snippet --
> Condor 6.6.1 Release notes:
> Known Bugs:
> * Submission of MPI jobs from a Unix machine to run on Windows
> machines (or
> vice versa) fails for machine_count > 1. This is not a new bug.
> Cross-platform
> submission of MPI jobs between Unix and Windows has always had this
> problem.
> -- snippet --
>
> Now I use a Windows machine as the central manager and submitter.
> I've
> installed as many UNIX tools as I needed to make the server more
> friendly
> (cygwin with all its support tools like sshd, etc).
>
> I run MPI jobs successfully now with a Windows submitter. I should
> also point
> out that I use MPICH NT 1.2.5. I have always used this version and I
> know
> Condor documentation specifically notes that 1.2.5 is not supported.
> I have
> not suffered any MPI related problems in my all-Windows pool.
>
> However, I have uncovered what I think is a bug in the file transfer
> mechanism
> when running MPI jobs on a Windows pool. As the number of files
> needed to
> transfer (tansfer_input_files) and the machine_count values rise, the
> chances
> of the file transfer failing gets very high - to the point that you
> can assume
> failure. I haven't heard many others talk about this, though I don't
> know how
> many people are using a Windows pool to run MPI jobs like myself. I
> submitted
> the bug to condor-admin a few months ago but I have not received many
> replies
> back. The few replies I did receive simply stated that they are too
> busy to
> read through the logs that I sent in. About a month ago I posted the
> problem
> to this mail list (dig through the archives and it should pop up).
>
> With all this said, if you successfully run win32 MPI code from a
> Linux server
> to 2 or more Windows execute nodes, let me know! I'll be very
> interested to
> know your exact setup. Cheers and good luck.
>
> Regards,
> Vahid
>
>
>
> --- Mike Busch <zenlc2000@xxxxxxxxx> wrote:
> > Erik,
> >
> > You say,
> >
> > > With the vanilla universe, you won't be able to allocate multiple
> > > machines
> > > in any sort of a group - you run the risk of a single node
> > > disappearing.
> > > With the MPI universe, the loss of a single node tells Condor to
> shut
> > > down
> > > all of the other machines, since Condor assumes your MPI
> > > implementation
> > > has no fault tolerance.
> >
> > Let's say I'm willing to accept the lack of fault tolerance just
> for
> > the sake of proving the concept. Is it possible to submit an
> NT-MPICH
> > job to a Linux Manager and have it run on a Win2k pool in the
> vanilla
> > universe?
>
>
> >
> > Thanks!
> > Mike
> >
> >
> >
> >
>
>
> =====
> < NPACI Education Center on Computational Science and Engineering >
> < http://www.edcenter.sdsu.edu/>
>
> "A friend is someone who knows the song in your heart and can sing it
> back to you when you have forgotten the words." -Unknown Author
> =====
>
>
>
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - Send 10MB messages!
> http://promotions.yahoo.com/new_mail
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail