Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Standardizing Condor GPU interface
- Date: Tue, 24 Sep 2013 12:16:01 -0500
- From: Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx>
- Subject: [HTCondor-users] Standardizing Condor GPU interface
Hello,
Here at IceCube we are about to start using Condor to run jobs on both
nVidia and AMD GPUs. We'd like our GPU jobs be to compatible with other
sites, so we followed
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageGpus,
which seems to be the closest thing to a standard for defining a Condor
interface for GPU jobs. I thought I'd share a few ideas to improve that
document to better accommodate mixed GPU environments based on our
experiences at IceCube.
GPU_API should be a list, since both CUDA and OpenCL can run on nVidia
GPUs, but only OpenCL can be used with AMD cards.
Currently, the wiki does not mention a classad attribute to identify
GPU's manufacturer, e.g. for users who want to run only on AMDs. We
decided to include it in GPU_NAME. However, one problem with GPU_NAME is
that it's not obvious how its content should be formatted (in order to
be compatible across sites). We thought about keeping it consistent with
lspci, but its output can be cryptic (e.g. GTX690 is listed as GK104),
and nvidia-smi and clinfo don't quite work either. Right now we just
manually set it in puppet to things like "nVidia GeForce GTX 690" and
"AMD Radeon HD 7970".
It may be useful to mention in the "Identify the GPU" section that both
CUDA and OpenCL use environmental variables to control which GPUs an
application may run on. We use something like the following in our
USER_JOB_WRAPPER script to set those automatically (this way things
don't break if the user forgets to appropriately set environment in the
submit file):
#!/bin/bash
gpu_dev=$(awk -F ' = ' '/^GPU_DEV = /{print $2}' $_CONDOR_MACHINE_AD)
export CUDA_VISIBLE_DEVICES=$gpu_dev
export COMPUTE=:0.$gpu_dev
export GPU_DEVICE_ORDINAL=$gpu_dev
exec "$@
One problem we encountered is that users who run primarily GPU jobs tend
to have much better priority than users who primarily run CPU jobs
(because there are many fewer GPUs than CPUs). This results in heavy CPU
users being almost completely locked out of using GPUs. We added the
following to reduce the severity of this problem, which may also be
useful for the wiki:
SlotWeight = ifthenelse(isUndefined(HAS_GPU), "Cpus", 100)
So these are my two cents. I'd be really interested what other people
are doing.
Vlad