Hi Masaj,
The CUDAComputeUnits figure is reported based on the card or cards installed in the system. There’s actually no attribute CUDA0ComputeUnits, since that’s
expected to be the same across all cards.
Here’s what is generated in per-card attributes with the “-extra -dynamic” options:
CUDA0DevicePciBusId = "0000:06:00.0"
CUDA0DeviceUuid = "520c5858-f08d-0e24-83b6-47e072996f2b"
CUDA0DieTempC = 32
CUDA0EccErrorsDoubleBit = 0
CUDA0EccErrorsSingleBit = 0
CUDA0FreeGlobalMemory = 8518
CUDA0PowerUsage_mw = 41538
CUDA0UtilizationPct = 77
You can write expressions to incorporate these values, but it won’t have any impact on which card is chosen for the job. The startd simply takes the next
unclaimed device in sequence from the AssignedGPUs list.
One way you can tweak that mechanism is to alter the order of the DetectedGPUs list as the inventory is being taken, perhaps with a wrapper around condor_gpu_discovery.
If your machine causes condor_gpu_discovery to list all the cards in one cooling region followed by all the cards in another cooling region within the system, you could balance the heating across both cooling regions by changing the order to “CUDA0,CUDA2,CUDA1,CUDA3”
so that GPU assignments would alternate between cooling regions, for example.
Michael V Pelletier
Principal Engineer
Raytheon Technologies
Digital Technology
HPC Support Team
Thank you Michael!
The formula below looks like a good idea. I have one additional question. Is it Okay to use classads in the form TARGET.CUDAComputeUnits when the real slot classad looks like CUDA0ComputeUnits or CUDA1ComputeUnits? Does Condor automatically able to translate
to a correct value using AssignedGPU?
Regards,
Masaj
On 5/20/2021 10:14 PM, Michael Pelletier via HTCondor-users wrote:
For my GPU jobs, I set up a ranking based on the number of compute units, times the number of cores per CU. You might also add the global memory. I do like
the idea of factoring in the CUDA capability level as well, if your cluster has more than one type of card in it.
So for example, in a submit description:
rank = TARGET.
CUDAComputeUnits * TARGET.
CUDACoresPerCU + CUDAFreeGlobalMemory
Michael V Pelletier
Principal Engineer
Raytheon Technologies
Digital Technology
HPC Support Team
On 5/20/2021 8:56 AM, Martin Sajdl wrote:
Hi!
we have a cluster of nodes with GPUs and we would need to set a benchmark number for each slot with GPU to be able to correctly control jobs ranking - start a job on the most powerful GPU available.
Do someone use or know a GPU benchmark tool? Ideally multi-platform (Linux, Windows)...
Hi Martin,
Just a quick thought:
While it is not strictly a benchmark, perhaps a decent proxy would be to use the CUDACapability attribute that is likely already present in each slot with a GPU (assuming they are NVIDIA gpus, that is).
You could enter the following condor_status command to see if you feel that CUDACapability makes intuitive sense as a performance metric on your pool:
condor_status -cons 'gpus>0' -sort CUDACapability -af name CudaCapability CudaDevicename
Hope the above helps
Todd
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/