Re: [HTCondor-users] Parallel universe - Multi-hosts MPI UCX not working

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Sorry about the necro-ing, but we found the same problem reappearing since my last post.

Finally found out what might have been the real problem all along.

On our clusters, since weâre using MPI, we have to set memlock to unlimited on the compute nodes.

This was done successfully system-wise.

But for some reason, condor was going back to the default of 64 instead of unlimited.

(using âulimit -aâ on the condor jobâs wrapper showed this behavior)

Found out we had already fixed this long ago, but the wrong way.

I was editing the /usr/lib/systemd/system/condor.service file directly, instead of creating a new one in /etc/systemd/system/condor.service.d/condor.conf and adding:

[Service]

LimitMEMLOCK=infinity

So, every time we updated condor, condor.service was probably being overwritten and losing the LimitMEMLOCK setting until I reinstalled everything from scratch again.

This is required to launch an MPI job over 2 compute servers or more with condor.

Martin

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jason Patton via HTCondor-users
Sent: December 16, 2022 4:24 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Jason Patton <jpatton@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Parallel universe - Multi-hosts MPI UCX not working

Martin,

Nice sleuthing! While the openmpiscript we provide seems to work pretty well for the heterogeneous pool case that would just be using tcp-over-ethernet (slow, but gets the job done if you absolutely need MPI), maybe we can add some caveats to the manual about using it and/or OpenMPI in an HPC (or at least more HPC-like) environment where you might want to use a faster network and/or squeeze out some extra performance.

Thanks for passing along your findings!

Jason

On Fri, Dec 16, 2022 at 11:11 AM Beaumont, Martin <Martin.Beaumont@xxxxxxxxxxxxxxx> wrote:

Hello again,

So after several tests during the path month, Iâve finally made it work.

First, during the configure process of compiling Open MPI, OFI support must be added:

Transports

-----------------------

OpenFabrics OFI Libfabric: yes

For me, this meant I required to install the libfabric-devel package on my system (Rocky Linux 8.x).

This made jobs using the openib work again with Open MPI 4.1.x under condor, but still wasnât working for ucx jobs.

Second, and for reasons I do not know, you have to disable btl ofi when launching ucx jobs.

In other words, Open MPI must be compiled with OFI support, but also be disabled for UCX jobs to work correctly within condor.

Simply excluding OFI support during Open MPI compilation does not make UCX jobs work under condor.

Here are the working mca parameters for both types of transport:

--mca btl ^openib,ofi --mca pml ucx --mca plm rsh     # UCX

--mca btl openib,self --mca pml ^ucx --mca plm rsh    # OpenIB

Some quick unprofessional benchmarks using an SU2 example shows some improvements between Open MPI versions and transports selection, at least proving that it wasnât all for nothing!

Open MPI            Transport            Time

4.1.4                      openib                  252s

4.1.4                      ucx                         242s

4.0.7                      openib                  250s

4.0.7                      ucx                         250s

3.1.6                      openib                  292s

3.1.6                      ucx                         287s

Martin

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jason Patton via HTCondor-users
Sent: November 18, 2022 5:18 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Jason Patton <jpatton@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Parallel universe - Multi-hosts MPI UCX not working

Hi Martin,

I helped develop part of the openmpiscript a few years ago, and it hasn't been tested since the days that OpenMPI 3.x was current, so I'm not too surprised that it's probably time to look at it again. I don't know what MCA parameters are available for UCX, but maybe you could try fiddling with MCA parameters to crank up the verbosity of messages (based on your email, I'm guessing one of these could be "--mca pml_ucx_verbose 100") and send along what you find.

Jason Patton

On Thu, Nov 17, 2022 at 12:07 PM Beaumont, Martin <Martin.Beaumont@xxxxxxxxxxxxxxx> wrote:

Hello all,

Using Open MPI 4.x does not work when a parallel universe job requests more than 1 host. Has anyone succeeded in using ucx PML with htcondorâs openmpiscript wrapper example?

$CondorVersion: 9.0.15 Jul 20 2022 BuildID: 597761 PackageID: 9.0.15-1 $

$CondorPlatform: x86_64_Rocky8 $

Using both OpenFOAM and SU2 compiled against Open MPI 4.1.4, using the following mpirun mca arguments â--mca btl ^openib --mca pml ucx --mca plm rshâ, parallel jobs running on only 1 host complete successfully. If more than one host is requested, UCX blurps out memory allocation errors and the job fails right away. This is using an InfiniBand fabric.

[1668707979.493116] [compute1:22247:0]       ib_iface.c:966 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory

â. X number of cores

[compute1:22247] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:309 Error: Failed to create UCP worker

--------------------------------------------------------------------------

No components were able to be opened in the pml framework.

This typically means that either no components of this type were

installed, or none of the installed components can be loaded.

Sometimes this means that shared libraries required by these

components are unable to be found/loaded.

Host:      compute1

Framework: pml

--------------------------------------------------------------------------

[compute1:22247] PML ucx cannot be selected

[compute1:22236] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:309 Error: Failed to create UCP worker

â. X number of cores

[compute2:22723] 31 more processes have sent help message help-mca-base.txt / find-available:none found

[compute2:22723] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

If I instead use Open MPI 3.1.6, with the arguments â--mca btl openib,self --mca plm rshâ, then multi-hosts jobs complete successfully.

Also, if I use the same mpirun command with â--mca btl ^openib --mca pml ucxâ without using HTCondor (no schedd, live on cluster using passwordless ssh), multi-hosts jobs also works.

Since openib is already deprecated with OMPI 4.x and scheduled to be removed in 5.x (https://www.open-mpi.org/faq/?category=openfabrics#openfabrics-default-stack), instead of trying to make openib work again with OMPI 4+, Iâd prefer to find a way to make UCX work within htcondor.

Using OMPI 3.1.6 is still viable for now, but Iâm guessing weâll eventually find an OS or app version that simply wonât work with old ompi versions.

My wild guess is that it has something to do with âorted_launcher.sh / get_orted_cmd.sh / condor_chirpâ and UCX not being able to work together properly, but this is beyond my understanding at this point.

Any clues would be appreciated. Thanks!

Martin

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Parallel universe - Multi-hosts MPI UCX not working