It doesn’t appear to be working, with either setting: OFFLINE_MACHINE_RESOURCE_GPUS = CUDA0 OFFLINE_MACHINE_RESOURCE_GPUS = “CUDA0” I set this up in the local config of the exec node which has the GPUs, did a condor_reconfig on that node (no args), and when I submit a GPU-requesting job to it, I get the following in the dynamic slot: CUDA_VISIBLE_DEVICES=0 _CONDOR_AssignedGPUs=CUDA0 When I restart, it works: CUDA_VISIBLE_DEVICES=1 _CONDOR_AssignedGPUs=CUDA1 But a restart doesn’t fly for a dynamic-availability feature. When it takes CUDA0 offline at startup, the AssignedGPUs in the partitionable slot changes to omit that string. Did I maybe not wait long enough for the collector ad to update after the reconfig, or some such? -Michael Pelletier. From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of John M Knoeller I think a restart is only required if you are using static slots and want the gpus to be un-assigned to a static slot. I don’t believe a restart is actually required when using partitionable slots, the offline GPUs will just not be assigned to any NEW dyamic slot. The intent of this knob is that you would set it via condor_config_val -set and then reconfig. -tj |