[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Parallel universe - Multi-hosts MPI UCX not working



Hello all,

 

Using Open MPI 4.x does not work when a parallel universe job requests more than 1 host. Has anyone succeeded in using ucx PML with htcondor’s openmpiscript wrapper example?

 

$CondorVersion: 9.0.15 Jul 20 2022 BuildID: 597761 PackageID: 9.0.15-1 $

$CondorPlatform: x86_64_Rocky8 $

 

Using both OpenFOAM and SU2 compiled against Open MPI 4.1.4, using the following mpirun mca arguments “--mca btl ^openib --mca pml ucx --mca plm rsh”, parallel jobs running on only 1 host complete successfully. If more than one host is requested, UCX blurps out memory allocation errors and the job fails right away. This is using an InfiniBand fabric.

 

 

[1668707979.493116] [compute1:22247:0]       ib_iface.c:966  UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory

…. X number of cores

[compute1:22247] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:309  Error: Failed to create UCP worker

--------------------------------------------------------------------------

No components were able to be opened in the pml framework.

 

This typically means that either no components of this type were

installed, or none of the installed components can be loaded.

Sometimes this means that shared libraries required by these

components are unable to be found/loaded.

 

  Host:      compute1

  Framework: pml

--------------------------------------------------------------------------

[compute1:22247] PML ucx cannot be selected

[compute1:22236] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:309  Error: Failed to create UCP worker

…. X number of cores

[compute2:22723] 31 more processes have sent help message help-mca-base.txt / find-available:none found

[compute2:22723] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

 

 

If I instead use Open MPI 3.1.6, with the arguments “--mca btl openib,self --mca plm rsh”, then multi-hosts jobs complete successfully.

Also, if I use the same mpirun command with “--mca btl ^openib --mca pml ucx” without using HTCondor (no schedd, live on cluster using passwordless ssh), multi-hosts jobs also works.

 

Since openib is already deprecated with OMPI 4.x and scheduled to be removed in 5.x (https://www.open-mpi.org/faq/?category=openfabrics#openfabrics-default-stack), instead of trying to make openib work again with OMPI 4+, I’d prefer to find a way to make UCX work within htcondor.

Using OMPI 3.1.6 is still viable for now, but I’m guessing we’ll eventually find an OS or app version that simply won’t work with old ompi versions.

 

My wild guess is that it has something to do with “orted_launcher.sh / get_orted_cmd.sh / condor_chirp” and UCX not being able to work together properly, but this is beyond my understanding at this point.

 

Any clues would be appreciated. Thanks!

 

Martin