[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] condor-mpich
- Date: Wed, 25 Feb 2004 17:33:48 -0500
- From: Joel Hernandez <joelh@xxxxxxxxx>
- Subject: Re: [condor-users] condor-mpich
Erik Paulson wrote:
On Fri, Feb 20, 2004 at 03:39:11PM -0500, Joel Hernandez wrote:
I've been trying to setup several of our nodes to run as dedicated
resources in order to run MPI jobs. I've tested the setup using the
simple example in section 2.10 of the online Condor Manual for V6.6.
However instead of containing the print out from stdin, the outfile
contains the following error message:
rm_3660: (-) net_recv failed for fd = 3
rm_3660: p4_error: net_recv read, errno = : 104
Has anyone encountered this problem?
Which version of MPICH? MPICH 1.2.5 changed what it expects the P4_RSH_COMMAND
to do - the 1.2.4 and previous versions were ok if the RSH command used to
startup the job exited before the job completed, 1.2.5 considers it an error.
In Condor, we don't use a real RSH to start up a job, so we have an RSH
"in name only", and it exits before the job completes. We do plan to
come up with a workaround, but it'll be a while before that gets into
Condor.
That's the problem. I installed MPICH 1.2.4 and it works!
However, I now have another problem. We have two eight node dual cpu
clusters (louie and duey). Users submit their jobs on louie and when
all the nodes are busy, they start to flock and run on duey. This works
great for non-MPI jobs.
I am now able to run the simple example MPI job from section 2.10.0 of
the Condor users manual on either louie or duey. But in order to run an
MPI job on duey, I have to submit the job on duey. If I use the
following macro in a submit file on louie
requirements = machine == "duey3.cnidr.org"
and submit the job from louie, the job just stays in the queue.
Any ideas?
Thanks,
Joel
-----------------------------------------------------------------
Joel Hernandez
Systems Programmer / Analyst
MCNC-RDI Center for Networked Information Discovery and Retrieval
joelh@xxxxxxxxx
http://www.cnidr.org
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>