Hi,
thanks for confirming the situation and that the USER_JOB_WRAPPER
(partially) does what we want!
For reference:
I've now implemented this, using a setuid binary that starts zeusd.
zeusd then provides a unix-socket to the job over which the job
can send HTTP API requests to change the power and clock
preferences.
As this is started in the USER_JOB_WRAPPER, zeusd already only sees the relevant GPUs, which is great.
There remain two issues:
1. This does actually not work with docker jobs, as the
docker_proc (or how that's called) in HTCondor does not run the
USER_JOB_WRAPPER (which would be a challenge, I know), so we only
support this with Apptainer jobs.
2. The docs clearly state, we should "exec $@" in the
USER_JOB_WRAPPER - however, we really cannot do that, as we need
to be able to clean up after a job, i.e. reset the GPU parameters.
So we actually register a bash trap and run the user command
(/usr/bin/singularity ...) without "exec". I'm not entirely sure
if this breaks anything.. in our testing everything seemed to work
as expected... Note, this wouldn't actually be necessary, if we
were able to run all jobs with the USER_JOB_WRAPPER, as
then, we could reset the GPU before the job start. But as
mentioned in 1, this doesn't work with Docker.
Best,
- Joachim
On 2/13/26 08:49, Joachim Meyer wrote:
Hi,
we have a request from users to be able to "tune" a few machine variables, like set a power or frequency cap on the GPU via nvidia-smi.
As far as I understand, these commands do need root access. E.g. "sudo nvidia-smi -i 0 -pl 300W" would reduce the power limit to 300W.
Hi Joachim:
We do not have a good way to do this. The USER_JOB_WRAPPER might be able to do this, but it does not have root privilege. In theory, one could right a setuid script that lowers the wattage of the gpu that you could put into the USER_JOB_WRAPPER, but I will leave it to you if that is a risk you'd be willing to take.
-greg
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htco