[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Wrapper/Knobs to "tune" GPU power limit (nvidia-smi -pl) per Job



Hi,

we have a request from users to be able to "tune" a few machine variables, like set a power or frequency cap on the GPU via nvidia-smi.
As far as I understand, these commands do need root access. E.g. "sudo nvidia-smi -i 0 -pl 300W" would reduce the power limit to 300W.

Obviously, we have issues with making this directly available to users - we would have to set this up to be available to everyone, but we don't have any control on them reverting the change once the job is done.

So, we were wondering, if there is a way to set this up & revert before the job runs - e.g. using USER_JOB_WRAPPER or similar, depending on the job's ClassAds?

What maybe plays a bit of a role for the answer: we use containerized jobs (docker & container universe) for everything.
So far my very short experiments haven't shown any success with trying to combine USER_JOB_WRAPPER with container jobs.

We currently are on HTCondor 24.x, but are eyeing an upgrade on 25.x - if that changes anything.

Do you have any experience / guidance for doing these things in an HTCondor cluster?

Thanks,
- Joachim Meyer

--

Joachim Meyer
HPC-Koordination & Support

UniversitÃt des Saarlandes FR Informatik | HPC

Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken

Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken

T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de