On 01/07/2010 12:09 PM, Ian D. Alderman wrote:
>
> On Jan 7, 2010, at 9:38 AM, Miron Livny wrote:
>
>> To all GPUers out there,
>>
>> We would be very interested in hearing from you what Condor can do to
>> help you in managing GPU clusters. So far we did not find much we can
>> offer in this space. Any guidance you can provide will be most welcomed.
>>
>> Miron
>
>
> Hi,
>
> We've done work helping customers to set up policies enabling GPU
> scheduling. Our approach has been to set attributes in GPU-specific jobs
> and slot-types, and require that the attribute be set to match with
> GPU-specific slots. Condor handles the scheduling gracefully given this
> setup.
>
> A majority of the work relates to policies. It would be great to get
> information about the presence of the GPU, its model, and utilization,
> but we're not aware of any standard ways to do this between GPU
> vendors/models. GPU model specific scripts can be created to advertise
> this information in the slot ads using Hawkeye/STARTD_CRON for a
> dedicated cluster. Condor could help by offering concurrency limits for
> an individual host (e.g. this machine has a GPU_Limit=2 because it has
> only 2 GPUs), or making dynamic slots more configurable.
>
> Because of the difficulties w/automatic detection and telemetry, using
> pre-created policies seems to work well.
>
> Cheers,
>
> -Ian
>
I've put a lot of thought into how host specific concurrency limits along with dynamic slots could work to manage things like GPU resources. I was hoping to mock up an implementation over the holidays but ended up just relaxing instead. If you're interested in such functionality let me know and I'll share my thoughts with you.