Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Standardizing Condor GPU interface

Date: Tue, 24 Sep 2013 12:16:01 -0500
From: Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx>
Subject: [HTCondor-users] Standardizing Condor GPU interface

Hello,

Here at IceCube we are about to start using Condor to run jobs on bothnVidia and AMD GPUs. We'd like our GPU jobs be to compatible with othersites, so we followedhttps://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageGpus,which seems to be the closest thing to a standard for defining a Condorinterface for GPU jobs. I thought I'd share a few ideas to improve thatdocument to better accommodate mixed GPU environments based on ourexperiences at IceCube.

GPU_API should be a list, since both CUDA and OpenCL can run on nVidiaGPUs, but only OpenCL can be used with AMD cards.

Currently, the wiki does not mention a classad attribute to identifyGPU's manufacturer, e.g. for users who want to run only on AMDs. Wedecided to include it in GPU_NAME. However, one problem with GPU_NAME isthat it's not obvious how its content should be formatted (in order tobe compatible across sites). We thought about keeping it consistent withlspci, but its output can be cryptic (e.g. GTX690 is listed as GK104),and nvidia-smi and clinfo don't quite work either. Right now we justmanually set it in puppet to things like "nVidia GeForce GTX 690" and"AMD Radeon HD 7970".

It may be useful to mention in the "Identify the GPU" section that bothCUDA and OpenCL use environmental variables to control which GPUs anapplication may run on. We use something like the following in ourUSER_JOB_WRAPPER script to set those automatically (this way thingsdon't break if the user forgets to appropriately set environment in thesubmit file):


#!/bin/bash
gpu_dev=$(awk -F ' = ' '/^GPU_DEV = /{print $2}' $_CONDOR_MACHINE_AD)
export CUDA_VISIBLE_DEVICES=$gpu_dev
export COMPUTE=:0.$gpu_dev
export GPU_DEVICE_ORDINAL=$gpu_dev
exec "$@

One problem we encountered is that users who run primarily GPU jobs tendto have much better priority than users who primarily run CPU jobs(because there are many fewer GPUs than CPUs). This results in heavy CPUusers being almost completely locked out of using GPUs. We added thefollowing to reduce the severity of this problem, which may also beuseful for the wiki:


SlotWeight = ifthenelse(isUndefined(HAS_GPU), "Cpus", 100)

So these are my two cents. I'd be really interested what other peopleare doing.


Vlad

Prev by Date: [HTCondor-users] Condor Daemon Log Rotations
Next by Date: Re: [HTCondor-users] Job CPU usage updates
Previous by thread: [HTCondor-users] Condor Daemon Log Rotations
Next by thread: [HTCondor-users] Controlling Jobs - Disabling/preventing a job from being suspended
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Standardizing Condor GPU interface