Date: | Tue, 15 Feb 2005 17:04:45 +0800 |
---|---|
From: | Nigel Teow <nigelt@xxxxxxxxxxxxxxxxx> |
Subject: | [Condor-users] How to troubleshoot MPI job |
Hi, Had installed condor (version 6.6.8) on a cluster, Am able to use condor_submit to run the mpi job on a single node but when I tried to run on 2 nodes, it fails. Following are the output files, outfile.0 ----------- p0_28434: p4_error: Child process exited while making connection to remote process on compute-0-1.local: 0 p0_28434: (2.007812) net_send: could not write to fd=4, errno = 32 outfile.1 ----------- rm_28438: (-) net_recv failed for fd = 3 rm_28438: p4_error: net_recv read, errno = : 104 Following is the submit script ran, ################################## # Condor submit description file ################################## universe = MPI executable = hello-world log = logfile input = /dev/null output = outfile.$(NODE) error = errfile.$(NODE) machine_count = 2 queue Am able to run the hello-world program directly using mpirun. Would appreciate if anyone could advice on how I could troubleshoot this, thanks in advance. Nigel -- Nigel Teow Bioinformatics Institute |
[← Prev in Thread] | Current Thread | [Next in Thread→] |
---|---|---|
|
Previous by Date: | RE: [Condor-users] about scheduling algorithm in condor and condor-g, Carson Hung |
---|---|
Next by Date: | [Condor-users] My problems, Yenke Blaise |
Previous by Thread: | Re: [Condor-users] how to submit jobs of same executable file with different arguments, Erik Paulson |
Next by Thread: | Re: [Condor-users] How to troubleshoot MPI job, Erik Paulson |
Indexes: | [Date] [Thread] |