It might be -compact, which is like adding -constraint âPartitionableSlot =?= true || DynamicSlot =!= trueâ But -compact only shows one line per machine, even if it gets back multiple ads for that machine.
This can lead to weird results when you mix -compact with -af but have static slots or multiple p-slots. -tj From: Russell Smithies <Russell.Smithies@xxxxxxxxxx> That's exactly what we'd like
😊 I did a few installs and uninstalls and miraculously the servers connected - I still have no idea why but it's working now! I'm only seeing one GPU per node (the first device?) which is odd as all the servers have two GPUs, it could be the way I have my constraints? muthur# /usr/libexec/condor/condor_gpu_discovery -extra DetectedGPUs="GPU-5f846c33, GPU-c60861f1" Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108; CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe"; DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051; MaxSupportedVersion=12020;
] GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ] GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ] muthur# condor_status -constraint '!isUndefined(DetectedGPUs)' -compact -af:h machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUsMemoryUsage
GPUs_DeviceUuid DeviceGPUsAverageUsage machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid DeviceGPUsAverageUsage kscprod-data1 Tesla V100-PCIE-32GB 7.0 12.1 32501 271.0 5e382249-1938-0c64-2b04-04631b812baa 0.0 kscprod-data2 NVIDIA A100-PCIE-40GB 8.0 12.1 40377 29496.0 387fd653-c749-2ec6-8eab-f967090d6579 0.6562980190294957 kscprod-data3 NVIDIA A100 80GB PCIe 8.0 12.2 81051 878.0 5f846c33-4dd5-ad62-eb12-c3813915d819 0.0001105264908529342 thanx, --Russell -----Original Message----- We are currently working on condor_status -gpus and hope to have something in the next version of HTCondor. Something like this is likely Name User GPUs GPU-Memory GPU-Name
slot1@machine1
user1@xxxxxxxxxxxxx 1 10.6 GB NVIDIA GeForce RTX 2080 Ti slot1@mahine2
user2@xxxxxxxxxxxxx 1 15.9 GB Tesla P100-PCIE-16GB
... I would be interested in your thoughts about what sort of information you would like to see. -tj -----Original Message----- From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Russell Smithies Sent: Wednesday, July 26, 2023 9:57 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Condor to manage GPUs only I figured it out eventually - I had the "use feature" bit in the config, but the tags start with "GPUs_" not CUDA" eg. " GPUs_DeviceName" not " CUDADeviceName" muthur# condor_status -constraint '!isUndefined(DetectedGPUs)' -compact -af:h machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid DeviceGPUsAverageUsage machine GPUs_DeviceName GPUs_Capability GPUs_DriverVersion GPUs_GlobalMemoryMb GPUs_DeviceUuid kscprod-data3.esr.cri.nz NVIDIA A100 80GB PCIe 8.0 12.2 81051 5f846c33-4dd5-ad62-eb12-c3813915d819 My next issue is sorting out munge authentication if anyone can point me to some useful docs? I can't get it to use anything but the default tokens ;-( We've used munge on slurm so I don't see any great need to change. --Russell -----Original Message----- From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller via HTCondor-users Sent: Thursday, July 27, 2023 1:08 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Cc: John M Knoeller <johnkn@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Condor to manage GPUs only Add use FEATURE : GPUs to the configuration of your STARTD to have it run condor_gpu_detection on startup and treat the GPUs as slot resources. -tj -----Original Message----- From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Russell Smithies Sent: Wednesday, July 26, 2023 3:35 PM Subject: [HTCondor-users] Condor to manage GPUs only Hi all, I used Condor 20 years ago and am trying to transition back from slurm. I want to initially only use Condor for managing the GPUs on 3 servers, two servers have 2 x A100s and one server has 2 X V100. I'm not sure of the best way to do this - or if it's even possible? Surely given the number of products that are "powered by GPUs" it must be. When I do a "condor_gpu_discovery" I can see the GPUs: muthur# /usr/libexec/condor/condor_gpu_discovery -extra -nested DetectedGPUs="GPU-5f846c33, GPU-c60861f1" Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108; CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe"; DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051; MaxSupportedVersion=12020; ] GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ] GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ] But when I do "condor_status" I don't see the GPUs but only see the CPU resources. And on this server with a pair of AMD EPYC 75F3 processors that's 128 slots to scroll through. What I really want to see is no CPU slots, only the GPUs. Is this possible or am I asking too much. Is there a better way of job scheduling for GPUs? Thanx, Russell Smithies _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting The archives can be found at: _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting The archives can be found at: _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting The archives can be found at: _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting The archives can be found at: This email has been filtered by SMX. For more information visit
smxemail.com |