[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multi-gpu-nodes limit access per slot



Hi,

thanks for the helpful thoughts ! 

We usually have 'only' one gpu per node hence there are no such problems and I started of with 4 static slots on the only 4-gpu-node we have as I thought there will be no usage for a multi-gpu slot. But as always someone spotted the opportunity and does have a project that would profite from multi-gpu setup, hence I changed the 4-gpu machine and made a dynamic/partitionable slot there. 

Now user complain they can see (use?) 4 gpus from a single gpu-slot, will have to dig further into this to get the problems right ... 

Best
Christoph 

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "nicolas fournials" <nicolas.fournials@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 11. Dezember 2019 09:41:24
Betreff: Re: [HTCondor-users] multi-gpu-nodes limit access per slot

Hi Christoph and Todd,

> On 12/10/2019 10:52 AM, Beyer, Christoph wrote:
>> Hi,
>>
>> I do have one 4 gpu node and wonder if there is a way to limit the usage on slot base, for ex 4 slots that just see & access each a single GPU. Are cgroups the way to do so and if yes how is it configured ?
>>
> 
> Maybe on this node just configure HTCondor with four static slots, each
> with one GPU and some amount of CPU/RAM?  If you need partitionable
> slots for some reason (e.g. RAM), you could edit your START expression
> to say only jobs requesting 0 or 1 GPUs will be matched....

We did some tests here on 2 sockets/2 GPUs nodes.
We used this static slots solutions to get a correct CPU/GPU affinity 
(to limit undue latencies). I suppose you may limit resources as well? 
For example:

# Create specific slots to enforce CPU/GPU affinity
# This conf DOES NOT suit MPI multi-nodes jobs
SLOT_TYPE_1 = cpus=2, gpus=1
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_1 = 2
# GPUs will always be assigned to the partitionable slots in order
ENFORCE_CPU_AFFINITY: True
SLOT1_CPU_AFFINITY: 0,2 # etc...
SLOT2_CPU_AFFINITY: 1,3


> As for restricting access to the GPUs, HTCondor will set
> CUDA_VISIBLE_DEVICES environment variable (and the OpenCL equal) to
> point to the GPU provisioned to that slot. This environment variable is
> honored by low-level CUDA libraries.   Are you worried about GPU codes
> that purposefully ignore or clear this environment variable?

We were worried about this here. One solution we imagined would be to 
let /dev/nvidia[0-X] be writable only by their owner (root), and use a 
wrapper when a job begins to change the ownership (via a dedicated 
script launched with sudo).
This is a nice solution from a user point of view, because you can only 
see what you've been attributed like if you were alone on your node, 
regardless of the env.

However, it seems to me that cgroups to manage /dev/nvidia[0-X] devices 
would be really neat.



-- 
Regards,

Nicolas Fournials
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/