Hi,
we have a request from users to be able to "tune" a few machine
variables, like set a power or frequency cap on the GPU via
nvidia-smi.
As far as I understand, these commands do need root access. E.g.
"sudo nvidia-smi -i 0 -pl 300W" would reduce the power limit to
300W.
Obviously, we have issues with making this directly available to users - we would have to set this up to be available to everyone, but we don't have any control on them reverting the change once the job is done.
So, we were wondering, if there is a way to set this up & revert before the job runs - e.g. using USER_JOB_WRAPPER or similar, depending on the job's ClassAds?
What maybe plays a bit of a role for the answer: we use
containerized jobs (docker & container universe) for
everything.
So far my very short experiments haven't shown any success with
trying to combine USER_JOB_WRAPPER with container jobs.
We currently are on HTCondor 24.x, but are eyeing an upgrade on 25.x - if that changes anything.
Do you have any experience / guidance for doing these things in an HTCondor cluster?
Thanks,
- Joachim Meyer
Joachim Meyer
HPC-Koordination & Support
UniversitÃt des Saarlandes FR Informatik | HPC
Postanschrift: Postfach 15 11 50 | 66041 SaarbrÃcken
Besucheranschrift: Campus E1 3 | Raum 4.03 66123 SaarbrÃcken
T: +49 681 302-57522 jmeyer@xxxxxxxxxxxxxxxxxx www.uni-saarland.de