Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] RFC: Adding GPUs into Condor
- Date: Fri, 25 Mar 2011 19:44:08 +0100
- From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
- Subject: [Condor-users] RFC: Adding GPUs into Condor
Hi all,
I'd like to receive a few comments on the path where I boldly went forward and
added GPUs into Condor. I know several places have already done this, but I
have not found a working "recipe" out there so far. Note of caution, this is
still work in progress as I need to figure out how to get this into the frame
of dynamic slots
This is the current extra configuration of a box with 4 CPU cores and 4 GPUs
(beware of long lines):
# start only standard universe or jobs which ask for a GPU "+WantGPU in the
submit file)
START = ( JobUniverse =?= 1 || target.WantGPU =?= true )
# rank GPU jobs much higher to kick out non-GPU jobs
RANK = ( target.WantGPU =?= true ) * 10000000
# these settings are added by a CUDA program identifying the available cards
STARTD_ATTRS = GPU_DEV, GPU_NAME, GPU_CAPABILITY, GPU_GLOBALMEM_MB, \
GPU_MULTIPROC, GPU_NUMCORES, GPU_CLOCK_GHZ, GPU_CUDA_DRV, \
GPU_CUDA_RUN, GPU_MULTIPROC, GPU_NUMCORES
SLOT1_GPU_CUDA_DRV=3.20
SLOT1_GPU_CUDA_RUN=3.20
SLOT1_GPU_DEV=0
SLOT1_GPU_NAME="Tesla C2050"
SLOT1_GPU_CAPABILITY=2.0
SLOT1_GPU_GLOBALMEM_MB=2687
SLOT1_GPU_MULTIPROC=14
SLOT1_GPU_NUMCORES=32
SLOT1_GPU_CLOCK_GHZ=1.15
SLOT2_GPU_CUDA_DRV=3.20
SLOT2_GPU_CUDA_RUN=3.20
SLOT2_GPU_DEV=1
SLOT2_GPU_NAME="Tesla C2050"
SLOT2_GPU_CAPABILITY=2.0
SLOT2_GPU_GLOBALMEM_MB=2687
SLOT2_GPU_MULTIPROC=14
SLOT2_GPU_NUMCORES=32
SLOT2_GPU_CLOCK_GHZ=1.15
SLOT3_GPU_CUDA_DRV=3.20
SLOT3_GPU_CUDA_RUN=3.20
SLOT3_GPU_DEV=2
SLOT3_GPU_NAME="Tesla C2050"
SLOT3_GPU_CAPABILITY=2.0
SLOT3_GPU_GLOBALMEM_MB=2687
SLOT3_GPU_MULTIPROC=14
SLOT3_GPU_NUMCORES=32
SLOT3_GPU_CLOCK_GHZ=1.15
SLOT4_GPU_CUDA_DRV=3.20
SLOT4_GPU_CUDA_RUN=3.20
SLOT4_GPU_DEV=3
SLOT4_GPU_NAME="Tesla C2050"
SLOT4_GPU_CAPABILITY=2.0
SLOT4_GPU_GLOBALMEM_MB=2687
SLOT4_GPU_MULTIPROC=14
SLOT4_GPU_NUMCORES=32
SLOT4_GPU_CLOCK_GHZ=1.15
most of this output is added to condor's config during boot-up when a local
service runs and queries the available cards (attached program).
A typical submit file might look like this:
Executable = matrixmult.sh
Arguments = $$(GPU_DEV)
# these variables are available on a per slot basis
# $$(GPU_NAME) $$(GPU_CAPABILITY) $$(GPU_GLOBALMEM_MB) $$(GPU_MULTIPROC)
$$(GPU_NUMCORES) $$(GPU_CLOCK_GHZ) $$(GPU_CUDA_DRV) $$(GPU_CUDA_RUN)
Error = logs/err.$(Process)
Output = logs/log.$(Process)
Log = gpu-local.log
+WantGPU=True
Universe = vanilla
Queue 10
with matrixmult.sh being
#!/bin/sh
DEVID=$1
# get the cuda environment on our cluster
. /usr/local/nvidia/sdk-3.2/setup.sh
/usr/local/nvidia/sdk-3.2/C/bin/linux/release/matrixMul --noprompt --
device=$DEVID
As you can see here, this is something very specific to our local systems.
My questions:
* is there a better way to do it?
* this is tailored for our Nvidia cards (we don't have any AMD ones in the
cluster environment so far), thus a similar beast needs to be slaughtered for
AMD GPUs or ideally a fuly OpenCL'ed version :)
* any other comments?
Cheers
Carsten
/*
* Copyright 1993-2010 NVIDIA Corporation. All rights reserved.
*
* NVIDIA Corporation and its licensors retain all intellectual property and
* proprietary rights in and to this software and related documentation.
* Any use, reproduction, disclosure, or distribution of this software
* and related documentation without an express license agreement from
* NVIDIA Corporation is strictly prohibited.
*
* Please refer to the applicable NVIDIA end user license agreement (EULA)
* associated with this source code for terms and conditions that govern
* your use of this NVIDIA software.
*
*/
/* This program was derived from the 3.2 SDK version of
* deviceQuery.cpp by Carsten Aulbert <carsten.aulbert@xxxxxxxxxx> and
* hence includes original source code from NVIDIA. I hope to compley
* with the EULA by placing this statement here:
* "This software contains source code provided by NVIDIA Corporation."
* My personal changes/addition to the original code are hereby placed in
* the public domain.
* No warrenty is attached to this code at all.
*/
// utilities and system includes
#include <shrUtils.h>
// CUDA-C includes
#include <cuda_runtime_api.h>
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int
main( int argc, const char** argv)
{
int deviceCount = 0;
if (cudaGetDeviceCount(&deviceCount) != cudaSuccess) {
shrLog("cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.\n");
shrLog("\nFAILED\n");
shrEXIT(argc, argv);
}
// This function call returns 0 if there are no CUDA capable devices.
if (deviceCount == 0)
shrLog("There is no device supporting CUDA\n");
shrLog("STARTD_ATTRS = GPU_DEV, GPU_NAME, GPU_CAPABILITY, GPU_GLOBALMEM_MB, GPU_MULTIPROC, GPU_NUMCORES, GPU_CLOCK_GHZ");
#if CUDART_VERSION >= 2020
shrLog(", GPU_CUDA_DRV, GPU_CUDA_RUN");
#endif
#if CUDART_VERSION >= 2000
shrLog(", GPU_MULTIPROC, GPU_NUMCORES");
#endif
shrLog("\n");
int dev;
for (dev = 1; dev <= deviceCount; ++dev) {
cudaDeviceProp deviceProp;
int driverVersion=0, runtimeVersion=0;
cudaGetDeviceProperties(&deviceProp, dev-1);
#if CUDART_VERSION >= 2020
cudaDriverGetVersion(&driverVersion);
cudaRuntimeGetVersion(&runtimeVersion);
shrLog("SLOT%d_GPU_CUDA_DRV=%d.%d\n", dev, driverVersion/1000, driverVersion%100);
shrLog("SLOT%d_GPU_CUDA_RUN=%d.%d\n", dev, runtimeVersion/1000, runtimeVersion%100);
#endif
shrLog("SLOT%d_GPU_DEV=%d\n", dev, dev-1);
shrLog("SLOT%d_GPU_NAME=\"%s\"\n", dev, deviceProp.name);
shrLog("SLOT%d_GPU_CAPABILITY=%d.%d\n", dev, deviceProp.major, deviceProp.minor);
shrLog("SLOT%d_GPU_GLOBALMEM_MB=%.0f\n", dev, deviceProp.totalGlobalMem/(1024.*1024.));
#if CUDART_VERSION >= 2000
shrLog("SLOT%d_GPU_MULTIPROC=%d\n", dev, deviceProp.multiProcessorCount);
shrLog("SLOT%d_GPU_NUMCORES=%d\n", dev, ConvertSMVer2Cores(deviceProp.major, deviceProp.minor));
#endif
shrLog("SLOT%d_GPU_CLOCK_GHZ=%.2f\n", dev, deviceProp.clockRate * 1e-6f);
}
// csv masterlog info
// *****************************
// exe and CUDA driver name
std::string sProfileString = "deviceQuery, CUDA Driver = CUDART";
// shrLogEx(LOGBOTH | MASTER, 0, sProfileString.c_str());
return 0;
}