Sorry about the necro-ing, but we found the same problem reappearing since my last post. Finally found out what might have been the real problem all along. On our clusters, since weâre using MPI, we have to set memlock to unlimited on the compute nodes. This was done successfully system-wise. But for some reason, condor was going back to the default of 64 instead of unlimited. (using âulimit -aâ on the condor jobâs wrapper showed this behavior) Found out we had already fixed this long ago, but the wrong way. I was editing the /usr/lib/systemd/system/condor.service file directly, instead of creating a new one in /etc/systemd/system/condor.service.d/condor.conf and adding: [Service] LimitMEMLOCK=infinity So, every time we updated condor, condor.service was probably being overwritten and losing the LimitMEMLOCK setting until I reinstalled everything from scratch again. This is required to launch an MPI job over 2 compute servers or more with condor. Martin From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Jason Patton via HTCondor-users Martin, Nice sleuthing! While the openmpiscript we provide seems to work pretty well for the heterogeneous pool case that would just be using tcp-over-ethernet (slow, but gets the job done if you absolutely need MPI), maybe we can add some caveats
to the manual about using it and/or OpenMPI in an HPC (or at least more HPC-like) environment where you might want to use a faster network and/or squeeze out some extra performance. Thanks for passing along your findings! Jason On Fri, Dec 16, 2022 at 11:11 AM Beaumont, Martin <Martin.Beaumont@xxxxxxxxxxxxxxx> wrote:
|