Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor to manage GPUs only

Date: Thu, 27 Jul 2023 23:11:26 +0000
From: Russell Smithies <Russell.Smithies@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor to manage GPUs only

That's exactly what we'd like ð

I did a few installs and uninstalls and miraculously the servers connected - I still have no idea why but it's working now!

I'm only seeing one GPU per node (the first device?) which is odd as all the servers have two GPUs, it could be the way I have my constraints?

muthur# /usr/libexec/condor/condor_gpu_discovery -extra

DetectedGPUs="GPU-5f846c33, GPU-c60861f1"

Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108; CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe"; DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051; MaxSupportedVersion=12020; ]

GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]

GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]

muthur# condor_status -constraint '!isUndefined(DetectedGPUs)' -compact -af:h machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid DeviceGPUsAverageUsage

machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid DeviceGPUsAverageUsage

kscprod-data1 Tesla V100-PCIE-32GB 7.0 12.1 32501 271.0 5e382249-1938-0c64-2b04-04631b812baa 0.0

kscprod-data2 NVIDIA A100-PCIE-40GB 8.0 12.1 40377 29496.0 387fd653-c749-2ec6-8eab-f967090d6579 0.6562980190294957

kscprod-data3 NVIDIA A100 80GB PCIe 8.0 12.2 81051 878.0 5f846c33-4dd5-ad62-eb12-c3813915d819 0.0001105264908529342

thanx,

--Russell

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller via HTCondor-users
Sent: Friday, July 28, 2023 10:12 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor to manage GPUs only

We are currently working on

condor_status -gpus

and hope to have something in the next version of HTCondor. Something like this is likely

Name User GPUs GPU-Memory GPU-Name

slot1@machine1 user1@xxxxxxxxxxxxx 1 10.6 GB NVIDIA GeForce RTX 2080 Ti slot1@mahine2 user2@xxxxxxxxxxxxx 1 15.9 GB Tesla P100-PCIE-16GB

...

I would be interested in your thoughts about what sort of information you would like to see.

-tj

-----Original Message-----

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Russell Smithies

Sent: Wednesday, July 26, 2023 9:57 PM

To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] Condor to manage GPUs only

I figured it out eventually - I had the "use feature" bit in the config, but the tags start with "GPUs_" not CUDA" eg. " GPUs_DeviceName" not " CUDADeviceName"

machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUs_DeviceUuid

kscprod-data3.esr.cri.nz NVIDIA A100 80GB PCIe 8.0 12.2 81051 5f846c33-4dd5-ad62-eb12-c3813915d819

My next issue is sorting out munge authentication if anyone can point me to some useful docs? I can't get it to use anything but the default tokens ;-( We've used munge on slurm so I don't see any great need to change.

--Russell

-----Original Message-----

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller via HTCondor-users

Sent: Thursday, July 27, 2023 1:08 PM

To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>

Cc: John M Knoeller <johnkn@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] Condor to manage GPUs only

Add

use FEATURE : GPUs

to the configuration of your STARTD to have it run condor_gpu_detection on startup and treat the GPUs as slot resources.

-tj

-----Original Message-----

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Russell Smithies

Sent: Wednesday, July 26, 2023 3:35 PM

To: htcondor-users@xxxxxxxxxxx

Subject: [HTCondor-users] Condor to manage GPUs only

Hi all,

I used Condor 20 years ago and am trying to transition back from slurm.

I want to initially only use Condor for managing the GPUs on 3 servers, two servers have 2 x A100s and one server has 2 X V100.

I'm not sure of the best way to do this - or if it's even possible? Surely given the number of products that are "powered by GPUs" it must be.

When I do a "condor_gpu_discovery" I can see the GPUs:

muthur# /usr/libexec/condor/condor_gpu_discovery -extra -nested

DetectedGPUs="GPU-5f846c33, GPU-c60861f1"

GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]

GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]

But when I do "condor_status" I don't see the GPUs but only see the CPU resources. And on this server with a pair of AMD EPYC 75F3 processors that's 128 slots to scroll through.

What I really want to see is no CPU slots, only the GPUs.

Is this possible or am I asking too much.

Is there a better way of job scheduling for GPUs?

Thanx,

Russell Smithies

_______________________________________________

HTCondor-users mailing list

To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a

subject: Unsubscribe

You can also unsubscribe by visiting

https://aus01.safelinks.protection.outlook.com/?url="">

The archives can be found at: