Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] requiring gpu through a HTCondor-CE
- Date: Mon, 25 Mar 2019 18:35:51 +0100
- From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
- Subject: [HTCondor-users] requiring gpu through a HTCondor-CE
Hello,
I have an exec node equipped with two GPUs:
[root@hpc-200-06-07 ~]# /usr/libexec/condor/condor_gpu_discovery -properties
DetectedGPUs="CUDA0, CUDA1"
CUDACapability=3.5
CUDADeviceName="Tesla K40m"
CUDADriverVersion=10.0
CUDAECCEnabled=true
CUDAGlobalMemoryMb=11441
CUDA0DevicePciBusId="0000:***"
CUDA0DeviceUuid="0caa****"
CUDA1DevicePciBusId="0000:***"
CUDA1DeviceUuid="158****"
The host can be identified through requirements:
[root@ce02-htc ~]# condor_status -constraint '((CUDACapability >= 1.2)
&& (CUDADeviceName =?= "Tesla K40m")) && (Arch == "X86_64") && (OpSys ==
"LINUX")'
NameÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ OpSysÂÂÂÂÂ ArchÂÂ State Activity
LoadAv MemÂÂÂÂ ActvtyTime
slot1@wn-01-02-03**** LINUXÂÂÂÂÂ X86_64 Unclaimed IdleÂÂÂÂÂ 0.000
128737Â 3+01:23:46
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Machines Owner Claimed Unclaimed Matched Preempting Drain
 X86_64/LINUX 1 0 0 1 0 0 0
ÂÂÂÂÂÂÂÂ TotalÂÂÂÂÂÂÂ 1ÂÂÂÂ 0ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂ 0 0ÂÂÂÂÂ 0
Direct submission to condor from the CE host works, using the following
submit file:
[sdalpra@ce02-htc htjobs]$ cat ce_testp308_gpu.sub
universe = vanilla
request_GPUs = 1
requirements = (CUDACapability >= 1.2) && (CUDADeviceName =?= "Tesla
K40m") && $(requirements:True)
executable = parrec_K40/parrec
output = parrec.out
error = parrec.err
log = parrec.log
arguments = "400 400 16 32 16"
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
transfer_input_files = parrec_K40/sinos_400.sdt,
parrec_K40/sinos_400.spr, parrec_K40/sinos.sct
transfer_output_files = sinos_400.sdt, sinos_400.spr, sinos.sct
queue
###########################
Submission to the HTCondor-CE succeeds:
[sdalpra@ui-htc htjobs]$ condor_submit -pool
ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool
ce_testp308_gpu.sub
Submitting job(s).
1 job(s) submitted to cluster 2953.
using this submit file:
[sdalpra@ui-htc htjobs]$ cat ce_testp308_gpu.sub
# Required for local HTCondor-CE submission
universe = vanilla
use_x509userproxy = true
+Owner = undefined
request_GPUs = 1
requirements = (TARGET.CUDACapability >= 1.2) && (TARGET.CUDADeviceName
=?= "Tesla K40m")
[.... the rest is the same...]
Âhowever the requirements are overriden by the set_requirements entry
in the routing table:
JOB_ROUTER_ENTRIES @=jre
[
ÂÂÂÂÂÂÂ name = "condor_pool_dteam";
ÂÂÂÂÂÂÂ TargetUniverse = 5;
ÂÂÂÂÂÂÂ Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
ÂÂÂÂÂÂÂ set_requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys
== "LINUX");
ÂÂÂÂÂÂÂ MaxJobs = 100;
ÂÂÂÂÂÂÂ MaxIdleJobs = 100;
]
By inspecting JOB_ROUTER_DEFAULTS it seems that the original
requirements are being overwritten anyway:
[...] set_requirements = True [...]
tracking a job submitted to the CE:
[root@ce02-htc ~]# condor_ce_q -l 2917. -af RoutedToJobId requirements
ClusterId = 2917
ProcId = 0
requirements = ((TARGET.CUDACapability >= 1.2) && (TARGET.CUDADeviceName
=?= "Tesla K40m")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys ==
"LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >=
RequestMemory) && (TARGET.GPUs >= RequestGPUs) && (TARGET.HasFileTransfer)
RoutedToJobId = "2548.0"
[root@ce02-htc ~]# condor_history -l 2548.0 -af RoutedFromJobId requirements
RoutedFromJobId = "2917.0"
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX")
I made several attempts to have the requirements in the submit file
routed from the CE to condor, but have found no succesful way until now.
Is it at all possible?
Any inspiring example?
Thank You
Stefano