Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Condor to manage GPUs only
- Date: Fri, 28 Jul 2023 15:13:36 +0000
- From: John M Knoeller <johnkn@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Condor to manage GPUs only
Thanks for the suggestion. I'm not sure about the GPU ordinal, I don't think we have that information for gpus that have a UUID, which should be all NVIDIA gpus at this point.
-tj
-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Valerio Bellizzomi
Sent: Friday, July 28, 2023 2:01 AM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Condor to manage GPUs only
On Thu, 2023-07-27 at 22:11 +0000, John M Knoeller via HTCondor-users
wrote:
> We are currently working on
>
> condor_status -gpus
>
> and hope to have something in the next version of
> HTCondor. Something like this is likely
>
> Name User GPUs
> GPU-Memory GPU-Name
>
> slot1@machine1 user1@xxxxxxxxxxxxx 1 10.6
> GB NVIDIA GeForce RTX 2080 Ti slot1@mahine2
> user2@xxxxxxxxxxxxx 1 15.9 GB Tesla P100-PCIE-
> 16GB
> ...
>
> I would be interested in your thoughts about what sort of information
> you would like to see.
>
> -tj
Maybe add the GPU ordinal, GPU global memory, and the GPU UUID ?
Cheers
Valerio
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> Of Russell Smithies
> Sent: Wednesday, July 26, 2023 9:57 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Condor to manage GPUs only
>
> I figured it out eventually - I had the "use feature" bit in the
> config, but the tags start with "GPUs_" not CUDA" eg. "
> GPUs_DeviceName" not " CUDADeviceName"
>
> muthur# condor_status -constraint '!isUndefined(DetectedGPUs)'
> -compact -af:h machine
> GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPU
> s_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid
> DeviceGPUsAverageUsage
> machine GPUs_DeviceName GPUs_Capability
> GPUs_DriverVersion GPUs_GlobalMemoryMb GPUs_DeviceUuid
> kscprod-data3.esr.cri.nz NVIDIA A100 80GB PCIe
> 8.0 12.2 81051 5f846
> c33-4dd5-ad62-eb12-c3813915d819
>
> My next issue is sorting out munge authentication if anyone can point
> me to some useful docs? I can't get it to use anything but the
> default tokens ;-(
> We've used munge on slurm so I don't see any great need to change.
>
> --Russell
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> Of John M Knoeller via HTCondor-users
> Sent: Thursday, July 27, 2023 1:08 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Condor to manage GPUs only
>
> Add
>
> use FEATURE : GPUs
>
> to the configuration of your STARTD to have it run
> condor_gpu_detection on startup and treat the GPUs as slot resources.
>
> -tj
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> Of Russell Smithies
> Sent: Wednesday, July 26, 2023 3:35 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: [HTCondor-users] Condor to manage GPUs only
>
>
> Hi all,
> I used Condor 20 years ago and am trying to transition back from
> slurm.
>
> I want to initially only use Condor for managing the GPUs on 3
> servers, two servers have 2 x A100s and one server has 2 X V100.
> I'm not sure of the best way to do this - or if it's even possible?
> Surely given the number of products that are "powered by GPUs" it
> must be.
>
> When I do a "condor_gpu_discovery" I can see the GPUs:
> muthur# /usr/libexec/condor/condor_gpu_discovery -extra -nested
> DetectedGPUs="GPU-5f846c33, GPU-c60861f1"
> Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108;
> CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe";
> DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051;
> MaxSupportedVersion=12020; ]
> GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0";
> DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]
> GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0";
> DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]
>
> But when I do "condor_status" I don't see the GPUs but only see the
> CPU resources. And on this server with a pair of AMD EPYC 75F3
> processors that's 128 slots to scroll through.
> What I really want to see is no CPU slots, only the GPUs.
> Is this possible or am I asking too much.
> Is there a better way of job scheduling for GPUs?
>
> Thanx,
>
> Russell Smithies
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/