[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] The dedicated scheduler use in production systems



We run the dedicated scheduler here at Western Washington University. It requires a little modification to the Systemd service in order to use InfiniBand (LimitMEMLOCK=infinity to ensure memory pinning for RDMA succeeds) with OpenMPI, and the openmpiscript wrapper needs to adjust the ulimit as well, but it works fine for those doing work on mulit-node jobs.

The Computer Science department has taught their parallel programming class on it a few times, with mixed results (the most recent time students managed to reliably crash the scheduler, but Greg Thain quickly got a fix in and it was fine afterwards).

There is a research group in Chemistry that runs VASP on it with hybrid OpenMPI+OpenMP, which I though was pretty neat and was fun to correctly set the OMP_* environment variables to not oversubscribe all the cores that OpenMPI is also using.

We've had some Data Science students use multi-node, multi-gpu OpenMPI before, but I don't think it's been used lately.

I hadn't heard the Nvidia news, thanks for sharing. My initial thought reading your message was something about their KAI scheduler, but the acquisition news is surprising. Wonder if they'll try and integrate the two?

-Zach

________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Beaumont, Martin <Martin.Beaumont@xxxxxxxxxxxxxxx>
Sent: Monday, December 15, 2025 12:51 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] The dedicated scheduler use in production systems

[You don't often get email from martin.beaumont@xxxxxxxxxxxxxxxx Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

What nvidia news?

We've being using the parallel universe here for over 5 years. It's not superbly convenient, but we've manage, somehow.
We have to figure out how to tweak the openmpiscript for every new scientific app or if new versions of the app or openmpi are to be used.
Basically, we have a modified wrapper for each use case based off the openmpiscript. Some even skip what it should be doing with the scheduler for better job handling.

Martin

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Matthew West via HTCondor-users
Sent: December 15, 2025 1:08 PM
To: htcondor-users@xxxxxxxxxxx
Cc: Matthew West <mwest53@xxxxxxxxxxxxx>
Subject: [HTCondor-users] The dedicated scheduler use in production systems

Good afternoon everybody,

Somewhat spurred by today's news from NVIDIA regarding job schedulers, I am curious if anyone is using the HTCondor dedicated scheduler / parallel universe in a production environment. Am curious about your experiences in how (well?) it fits your users workloads.

Cheers,
Matt

--
Matthew T. West (he,him,his) | Systems Programmer/Administrator UNC Charlotte | University Research Computing (office of OneIT)
9214 South Library Lane| Room 301 | Charlotte, NC 28223
Phone: 704-687-8766
mwest53@xxxxxxxxxxxxx | http://www.charlotte.edu/urc/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/