Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] HTCondor-CE on slurm
- Date: Mon, 17 Jun 2019 17:17:16 +0200
- From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
- Subject: [HTCondor-users] HTCondor-CE on slurm
Hello,
I'm trying to set up a HTC-CE instance on top of a slurm batch system.
[sdalpra0@r000u11l06-fe condor]$ rpm -qa | grep htcondor
htcondor-ce-3.2.2-1.el7.noarch
htcondor-ce-slurm-3.2.2-1.el7.noarch
htcondor-ce-client-3.2.2-1.el7.noarch
I am testing it as a dteam VO member, and the following Job Router rules:
The JOB_ROUTER_ENTRIES @=jre
[
ÂÂÂÂÂÂÂ name = "condor_pool_dteam";
ÂÂÂÂÂÂÂ GridResource = "batch slurm";
ÂÂÂÂÂÂÂ TargetUniverse = 9;
ÂÂÂÂÂÂÂ Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
ÂÂÂÂÂÂÂ MaxJobs = 100;
ÂÂÂÂÂÂÂ MaxIdleJobs = 100;
]
[
ÂÂÂÂÂÂÂ name = "condor_pool_cms";
ÂÂÂÂÂÂÂ GridResource = "batch slurm";
ÂÂÂÂÂÂÂ TargetUniverse = 9;
ÂÂÂÂÂÂÂ Requirements = target.x509UserProxyVOName =?= "cms";
ÂÂÂÂÂÂÂ MaxJobs = 1280;
ÂÂÂÂÂÂÂ MaxIdleJobs = 1280;
]
 @jre
A job submitted to the CE seems to be routed up to submission, where
it... Disappears:
JobRouterLog says:
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter
(src=18.0,dest=19.0,route=condor_pool_dteam): submitted job
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter
(src=18.0,dest=19.0,route=condor_pool_dteam): submitted job has not yet
appeared in job queue mirror or was removed (submitted 0 seconds ago)
(I copy below the complete chunk of this transaction from JobRouterLog).
AFAIK current HTCondor-CE version does no more depend on a blahp rpm.
I've found this shell script: /usr/libexec/condor/glite/bin/slurm_submit.sh
from the condor-8.8.2 rpm
but i'm not sure this is actually invoked somewhere or somehow;
I would need some enlightenment on how to troubleshoot this:
- How can i see what slurm submission command is generated?
(I added a cp $bls_tmp_file /tmp/copia_${bls_tmp_file} to see the slurm
submit file but no file is created,
thus i doubt this script is actually executed at all).
- How do i specify in the submit file the partition name? (and a few
most common slurm options, i would say;
do you have a simple example submit file for slurm?)
My submit file is:
[sdalpra@ui-htc slurm_cn]$ cat testp308.sub
# Required for local HTCondor-CE submission
universe = vanilla
use_x509userproxy = true
+Owner = undefined
# Files
executable = p308/htcp308
output = htcp308.out
error = htcp308.err
log = htcp308.log
arguments = "0 0 1 1001"
# File transfer behavior
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
transfer_output_files = htcp308.err, htcp308.out
queue
#########
Thanks,
Stefano
06/17/19 15:23:56 (D_ALWAYS:2) === Current Probing Information ===
06/17/19 15:23:56 (D_ALWAYS:2) fsize: 5111ÂÂÂÂÂÂÂÂÂÂÂÂÂ mtime: 1560777826
06/17/19 15:23:56 (D_ALWAYS:2) first log entry: 7 CreationTimestamp
1559046390
06/17/19 15:23:56 (D_ALWAYS) JobRouter: Checking for candidate jobs.
routing table is:
Route NameÂÂÂÂÂÂÂÂÂÂÂÂ Submitted/MaxÂÂÂÂÂÂÂ Idle/MaxÂÂÂÂ Throttle
Recent: Started Succeeded Failed
condor_pool_cmsÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0/ÂÂ 1280ÂÂÂÂÂÂ 0/ÂÂ 1280
noneÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0
condor_pool_dteamÂÂÂÂÂÂÂÂÂÂÂÂÂ 0/ÂÂÂ 100ÂÂÂÂÂÂ 0/ÂÂÂ 100
noneÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter: Umbrella constraint:
((target.x509userproxysubject =!= UNDEFINED) &&
(target.x509UserProxyExpiration =!= UNDEFINED) && (time() < target
.x509UserProxyExpiration) && (target.JobUniverse =?= 5 ||
target.JobUniverse =?= 1)) && ( (target.x509UserProxyVOName is "cms") ||
((regexp("dteam",TARGET.x509UserProxyVoName)))
Â) && (target.ProcId >= 0 && target.JobStatus == 1 &&
(target.StageInStart is undefined || target.StageInFinish isnt
undefined) && target.Managed isnt "ScheddDone" && target.Man
aged isnt "External" && target.Owner isnt Undefined && target.RoutedBy
isnt "htcondor-ce")
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter: Found candidate job
src=18.0,route=condor_pool_dteam
06/17/19 15:23:56 (D_ALWAYS:2) SharedPortClient: sent connection request
to schedd at <130.186.17.136:9619> for shared port id 723505_074b_3
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter
(src=18.0,route=condor_pool_dteam): claimed job
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Copying attribute RequestCpus to orig_RequestCpus
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Copying attribute environment to orig_environment
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Copying attribute OnExitHoldSubCode to orig_OnExitHoldSubCode
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Copying attribute OnExitHold to orig_OnExitHold
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Copying attribute OnExitHoldReason to orig_OnExitHoldReason
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Deleting attribute TotalSubmitProcs
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Deleting attribute CondorCE
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Deleting attribute PeriodicRemove
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute JobMemory
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute RequestMemory
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute JOB_GLIDEIN_Memory
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute osg_environment
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute requirements
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute GlideinCpusIsGood
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute OnExitHoldReason
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute OnExitHold
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute JobIsRunning
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute OnExitHoldSubCode
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute RoutedJob
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute RequestCpus
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute JobCpus
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute CondorCECollectorHost
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute JOB_GLIDEIN_Cpus
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute remote_queue to an evaluated expression
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute remote_OriginalMemory to an evaluated expression
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute OriginalMemory to an evaluated expression
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute environment to an evaluated expression
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.00191ms] Owner --> a07cms04
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.09513ms]
userHome(Owner,"/") --> /marconi/home/usera07cms/a07cms04
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.00310ms]
CondorCECollectorHost --> r000u11l06-fe.marconi.cineca.it:9619
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.00215ms]
orig_environment -->
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.00215ms]
osg_environment -->
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.00215ms]
orig_environment -->
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.04911ms]
strcat(osg_environment," ",orig_environment) -->
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.09990ms]
ifThenElse(orig_environment is
undefined,osg_environment,strcat(osg_environment," ",orig_environment)) -->
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.30994ms]
strcat("HOME=",userHome(Owner,"/"),"
CONDORCE_COLLECTOR_HOST=",CondorCECollectorHost,"
",ifThenElse(orig_environment is
undefined,osg_environment,strcat(osg_environment," ",orig_environment)))
--> HOME=/marconi/home/usera07cms/a07cms04
CONDORCE_COLLECTOR_HOST=r000u11l06-fe.marconi.cineca.it:9619
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.35000ms]
strcat("HOME=",userHome(Owner,"/"),"
CONDORCE_COLLECTOR_HOST=",CondorCECollectorHost,"
",ifThenElse(orig_environment is
undefined,osg_environment,strcat(osg_environment," ",orig_environment)))
--> HOME=/marconi/home/usera07cms/a07cms04
CONDORCE_COLLECTOR_HOST=r000u11l06-fe.marconi.cineca.it:9619
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute remote_SMPGranularity to an evaluated expression
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute remote_NodeNumber to an evaluated expression
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute remote_cerequirements to an evaluated expression
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam):
Setting attribute OriginalCpus to an evaluated expression
06/17/19 15:23:56 (D_ALWAYS:2) SharedPortClient: sent connection request
to schedd at <130.186.17.136:9619> for shared port id 723505_074b_3
06/17/19 15:23:56 (D_ALWAYS:2) SharedPortClient: sent connection request
to local schedd for shared port id 723505_074b_3
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter
(src=18.0,dest=19.0,route=condor_pool_dteam): submitted job
06/17/19 15:23:56 (D_ALWAYS:2) JobRouter
(src=18.0,dest=19.0,route=condor_pool_dteam): submitted job has not yet
appeared in job queue mirror or was removed (submitted 0 seconds ago)
06/17/19 15:24:06 (D_ALWAYS:2) JobRouter: polling state of (1) managed jobs.
06/17/19 15:24:06 (D_ALWAYS:2) TimerHandler_JobLogPolling() called
06/17/19 15:24:06 (D_ALWAYS:2) === Current Probing Information ===
06/17/19 15:24:06 (D_ALWAYS:2) fsize: 11084ÂÂÂÂÂÂÂÂÂÂÂÂ mtime: 1560777844
06/17/19 15:24:06 (D_ALWAYS:2) first log entry: 7 CreationTimestamp
1559046390
06/17/19 15:24:06 (D_ALWAYS) JobRouter: Checking for candidate jobs.
routing table is:
Route NameÂÂÂÂÂÂÂÂÂÂÂÂ Submitted/MaxÂÂÂÂÂÂÂ Idle/MaxÂÂÂÂ Throttle
Recent: Started Succeeded Failed
condor_pool_cmsÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0/ÂÂ 1280ÂÂÂÂÂÂ 0/ÂÂ 1280
noneÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0
condor_pool_dteamÂÂÂÂÂÂÂÂÂÂÂÂÂ 1/ÂÂÂ 100ÂÂÂÂÂÂ 1/ÂÂÂ 100
noneÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0
06/17/19 15:24:06 (D_ALWAYS:2) JobRouter: Umbrella constraint:
((target.x509userproxysubject =!= UNDEFINED) &&
(target.x509UserProxyExpiration =!= UNDEFINED) && (time() <
target.x509UserProxyExpiration) && (target.JobUniverse =?= 5 ||
target.JobUniverse =?= 1)) && ( (target.x509UserProxyVOName is "cms") ||
((regexp("dteam",TARGET.x509UserProxyVoName))) ) && (target.ProcId >= 0
&& target.JobStatus == 1 && (target.StageInStart is undefined ||
target.StageInFinish isnt undefined) && target.Managed isnt "ScheddDone"
&& target.Managed isnt "External" && target.Owner isnt Undefined &&
target.RoutedBy isnt "htcondor-ce")
06/17/19 15:24:06 (D_ALWAYS:2) SharedPortClient: sent connection request
to schedd at <130.186.17.136:9619> for shared port id 723505_074b_3
06/17/19 15:24:06 (D_ALWAYS:2) Setting RoutedToJobId = "19.0"
06/17/19 15:24:06 (D_ALWAYS:2) JobRouter
(src=18.0,dest=19.0,route=condor_pool_dteam): updated job status
06/17/19 15:24:07 (D_ALWAYS:2) JobRouter: Evaluating all managed jobs
periodic job policy expressions.
06/17/19 15:24:07 (D_ALWAYS:2) JobRouter: Evaluated all managed jobs
periodic expressions.
06/17/19 15:24:16 (D_ALWAYS:2) JobRouter: polling state of (1) managed jobs.
06/17/19 15:24:16 (D_ALWAYS:2) TimerHandler_JobLogPolling() called