That's exactly what we'd like ð
I did a few installs and uninstalls and miraculously the servers connected - I still have no idea why but it's working now!
I'm only seeing one GPU per node (the first device?) which is odd as all the servers have two GPUs, it could be the way I have my constraints?
muthur# /usr/libexec/condor/condor_gpu_discovery -extra
DetectedGPUs="GPU-5f846c33, GPU-c60861f1"
Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108; CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe"; DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051; MaxSupportedVersion=12020;
]
GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]
GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]
muthur# condor_status -constraint '!isUndefined(DetectedGPUs)' -compact -af:h machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUsMemoryUsage
GPUs_DeviceUuid DeviceGPUsAverageUsage
machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid DeviceGPUsAverageUsage
kscprod-data1 Tesla V100-PCIE-32GB 7.0 12.1 32501 271.0 5e382249-1938-0c64-2b04-04631b812baa 0.0
kscprod-data2 NVIDIA A100-PCIE-40GB 8.0 12.1 40377 29496.0 387fd653-c749-2ec6-8eab-f967090d6579 0.6562980190294957
kscprod-data3 NVIDIA A100 80GB PCIe 8.0 12.2 81051 878.0 5f846c33-4dd5-ad62-eb12-c3813915d819 0.0001105264908529342
thanx,
--Russell
-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller via HTCondor-users
Sent: Friday, July 28, 2023 10:12 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor to manage GPUs only
We are currently working on
condor_status -gpus
and hope to have something in the next version of HTCondor. Something like this is likely
Name User GPUs GPU-Memory GPU-Name
...
I would be interested in your thoughts about what sort of information you would like to see.
-tj
-----Original Message-----
Sent: Wednesday, July 26, 2023 9:57 PM
Subject: Re: [HTCondor-users] Condor to manage GPUs only
I figured it out eventually - I had the "use feature" bit in the config, but the tags start with "GPUs_" not CUDA" eg. " GPUs_DeviceName" not " CUDADeviceName"
muthur# condor_status -constraint '!isUndefined(DetectedGPUs)' -compact -af:h machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid DeviceGPUsAverageUsage
machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUs_DeviceUuid
kscprod-data3.esr.cri.nz NVIDIA A100 80GB PCIe 8.0 12.2 81051 5f846c33-4dd5-ad62-eb12-c3813915d819
My next issue is sorting out munge authentication if anyone can point me to some useful docs? I can't get it to use anything but the default tokens ;-( We've used munge on slurm so I don't see any great need to change.
--Russell
-----Original Message-----
Sent: Thursday, July 27, 2023 1:08 PM
Subject: Re: [HTCondor-users] Condor to manage GPUs only
Add
use FEATURE : GPUs
to the configuration of your STARTD to have it run condor_gpu_detection on startup and treat the GPUs as slot resources.
-tj
-----Original Message-----
Sent: Wednesday, July 26, 2023 3:35 PM
Subject: [HTCondor-users] Condor to manage GPUs only
Hi all,
I used Condor 20 years ago and am trying to transition back from slurm.
I want to initially only use Condor for managing the GPUs on 3 servers, two servers have 2 x A100s and one server has 2 X V100.
I'm not sure of the best way to do this - or if it's even possible? Surely given the number of products that are "powered by GPUs" it must be.
When I do a "condor_gpu_discovery" I can see the GPUs:
muthur# /usr/libexec/condor/condor_gpu_discovery -extra -nested
DetectedGPUs="GPU-5f846c33, GPU-c60861f1"
Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108; CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe"; DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051; MaxSupportedVersion=12020; ]
GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]
GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]
But when I do "condor_status" I don't see the GPUs but only see the CPU resources. And on this server with a pair of AMD EPYC 75F3 processors that's 128 slots to scroll through.
What I really want to see is no CPU slots, only the GPUs.
Is this possible or am I asking too much.
Is there a better way of job scheduling for GPUs?
Thanx,
Russell Smithies
_______________________________________________
HTCondor-users mailing list
subject: Unsubscribe
You can also unsubscribe by visiting
The archives can be found at:
_______________________________________________
HTCondor-users mailing list
subject: Unsubscribe
You can also unsubscribe by visiting
The archives can be found at:
_______________________________________________
HTCondor-users mailing list
subject: Unsubscribe
You can also unsubscribe by visiting
The archives can be found at:
_______________________________________________
HTCondor-users mailing list
subject: Unsubscribe
You can also unsubscribe by visiting
The archives can be found at: