[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor and MPI jobs
- Date: Sun, 24 May 2009 23:30:08 -0300
- From: Ary Junior <aryjunior@xxxxxxxxx>
- Subject: [Condor-users] Condor and MPI jobs
Hi, Im trying to run a job with MPI and Condor... I have my .submit file like this:
universe = vanilla
requirements = Activity == "Idle"
executable = LIME-443-001.sh
output = LIME-443-001.sh.out
error = LIME-443-001.sh.err
log = LIME-443-001.sh.log
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
queue
In this example, the LIME-443-001.sh have the content:
#!/bin/sh
export OMP_NUM_THREADS=1
export LD_LIBRARY_PATH=:/usr/lib64/mpi/gcc/openmpi/lib64
/usr/lib64/mpi/gcc/openmpi/bin/mpirun -np 2 /opt/espresso-mpi/bin/pw.x < /home/aryjr/SUPERFICIES/LIME/LIME-443-001.pw.inp > /home/aryjr/SUPERFICIES/LIME/LIME-443-001.pw.out
If I don't use Condor and execute the .sh file like "sh LIME-443-001.sh", all works fine... However, if I try to run "condor_submit LIME-443-001.submit" I get the error on LIME-443-001.sh.err file:
[xeonquad01:22365] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init_stage1.c at line 312
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_pls_base_select failed
--> Returned value -1 instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[xeonquad01:22365] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_system_init.c at line 42
[xeonquad01:22365] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 52
--------------------------------------------------------------------------
Open RTE was unable to initialize properly. The error occured while
attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS.
--------------------------------------------------------------------------
Anybody can help me?
Thanks very much!!!
Ary Juniort