[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Wrapper/Knobs to "tune" GPU power limit (nvidia-smi -pl) per Job



Hi,
thanks for confirming the situation and that the USER_JOB_WRAPPER (partially) does what we want!

For reference:
I've now implemented this, using a setuid binary that starts zeusd.
zeusd then provides a unix-socket to the job over which the job can send HTTP API requests to change the power and clock preferences.

As this is started in the USER_JOB_WRAPPER, zeusd already only sees the relevant GPUs, which is great.

There remain two issues:
1. This does actually not work with docker jobs, as the docker_proc (or how that's called) in HTCondor does not run the USER_JOB_WRAPPER (which would be a challenge, I know), so we only support this with Apptainer jobs.
2. The docs clearly state, we should "exec $@" in the USER_JOB_WRAPPER - however, we really cannot do that, as we need to be able to clean up after a job, i.e. reset the GPU parameters. So we actually register a bash trap and run the user command (/usr/bin/singularity ...) without "exec". I'm not entirely sure if this breaks anything.. in our testing everything seemed to work as expected... Note, this wouldn't actually be necessary, if we were able to run all jobs with the USER_JOB_WRAPPER, as then, we could reset the GPU before the job start. But as mentioned in 1, this doesn't work with Docker.

Best,
- Joachim


Am 24.02.26 um 19:16 schrieb Greg Thain via HTCondor-users:
On 2/13/26 08:49, Joachim Meyer wrote:

Hi,

we have a request from users to be able to "tune" a few machine variables, like set a power or frequency cap on the GPU via nvidia-smi.
As far as I understand, these commands do need root access. E.g. "sudo nvidia-smi -i 0 -pl 300W" would reduce the power limit to 300W.


Hi Joachim:

We do not have a good way to do this. The USER_JOB_WRAPPER might be able to do this, but it does not have root privilege. In theory, one could right a setuid script that lowers the wattage of the gpu that you could put into the USER_JOB_WRAPPER, but I will leave it to you if that is a risk you'd be willing to take.

-greg


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htco