Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor and MPI jobs
It has been a while since I tried running MPI jobs under Condor, but you
definitely need to work in the 'parallel' Universe. Also your executable
shell script is a bit too simple. Try these:
https://lists.cs.wisc.edu/archive/condor-users/2009-February/msg00024.shtml
It is all kind of RTFM I guess, but then again: the manual is very, very
complex and not updated for MPI yet (at least: the last time I checked,
which is some time ago, around Condor 6.8)
Jakob
Ary Junior wrote:
> Hi, Im trying to run a job with MPI and Condor... I have my .submit file
> like this:
>
> universe = vanilla
> requirements = Activity == "Idle"
> executable = LIME-443-001.sh
> output = LIME-443-001.sh.out
> error = LIME-443-001.sh.err
> log = LIME-443-001.sh.log
> should_transfer_files = IF_NEEDED
> when_to_transfer_output = ON_EXIT
> queue
>
> In this example, the LIME-443-001.sh have the content:
>
> #!/bin/sh
> export OMP_NUM_THREADS=1
> export LD_LIBRARY_PATH=:/usr/lib64/mpi/gcc/openmpi/lib64
> /usr/lib64/mpi/gcc/openmpi/bin/mpirun -np 2 /opt/espresso-mpi/bin/pw.x <
> /home/aryjr/SUPERFICIES/LIME/LIME-443-001.pw.inp >
> /home/aryjr/SUPERFICIES/LIME/LIME-443-001.pw.out
>
> If I don't use Condor and execute the .sh file like "sh
> LIME-443-001.sh", all works fine... However, if I try to run
> "condor_submit LIME-443-001.submit" I get the error on
> LIME-443-001.sh.err file:
>
> [xeonquad01:22365] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init_stage1.c at line 312
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_pls_base_select failed
> --> Returned value -1 instead of ORTE_SUCCESS
>
> --------------------------------------------------------------------------
> [xeonquad01:22365] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_system_init.c at line 42
> [xeonquad01:22365] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 52
> --------------------------------------------------------------------------
> Open RTE was unable to initialize properly. The error occured while
> attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS.
> --------------------------------------------------------------------------
>
> Anybody can help me?
>
> Thanks very much!!!
>
> Ary Juniort
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/