[Condor-users] How to troubleshoot MPI job


Date: Tue, 15 Feb 2005 17:04:45 +0800
From: Nigel Teow <nigelt@xxxxxxxxxxxxxxxxx>
Subject: [Condor-users] How to troubleshoot MPI job
Hi,

Had installed condor (version 6.6.8) on a cluster,

Am able to use condor_submit to run the mpi job on a single node but when I tried to run on 2 nodes, it fails. Following are the output files,

outfile.0
-----------
p0_28434: p4_error: Child process exited while making connection to remote process on compute-0-1.local: 0
p0_28434: (2.007812) net_send: could not write to fd=4, errno = 32


outfile.1
-----------
rm_28438: (-) net_recv failed for fd = 3
rm_28438:  p4_error: net_recv read, errno = : 104

Following is the submit script ran,

##################################
# Condor submit description file
##################################
universe = MPI
executable = hello-world
log = logfile
input = /dev/null
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 2
queue

Am able to run the hello-world program directly using mpirun.

Would appreciate if anyone could advice on how I could troubleshoot this, thanks in advance.

Nigel

--
Nigel Teow
Bioinformatics Institute


[← Prev in Thread] Current Thread [Next in Thread→]