Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Why jobs are not being picked anymore?
- Date: Thu, 11 Jun 2020 20:54:19 +0000
- From: John M Knoeller <johnkn@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Why jobs are not being picked anymore?
I'm guessing that node health is your problem. look at the bottom of the output.
[0] 0 NODE_IS_HEALTHY is true
It says that 0 of the nodes that matched earlier clauses satisfy this requirement clause
-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of jcaballero.hep@xxxxxxxxx
Sent: Tuesday, June 9, 2020 10:03 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Why jobs are not being picked anymore?
Hi,
I have another one of my weird questions :)
Here is the situation:
* I want certain jobs on a given schedd to run on a given set of
machines, and I don't want those machines to run anything else.
In order to achieve that, I have this configuration on the startds [1]
and this on the schedd [2].
* indeed, I can see the requirement in the IDLE jobs [3]
* it has been working fine for a while. However, since a few days ago,
jobs stay IDLE forever.
And better-analyze claims there are no host that could run the jobs as
far as I undertand [4].
Do you see any clue in the output of [5] that could help me to
understand why they don't run?
Is anything I could check in the Central Manager to troubleshoot?
Thanks a lot in advance.
Cheers,
Jose
[1]
[root @ startd ~]# rpm -qa | grep condor
condor-8.6.13-1.el7.x86_64
condor-classads-8.6.13-1.el7.x86_64
condor-procd-8.6.13-1.el7.x86_64
condor-external-libs-8.6.13-1.el7.x86_64
condor-python-8.6.13-1.el7.x86_64
tier1-condor-wn-healthcheck-1.10-1.x86_64
mjf-htcondor-00.14-1.noarch
tier1-condor-docker-1.6.4-1.noarch
[root @ startd ~]# condor_config_val STARTD_ATTRS
RalCluster, RalSnapshot, RalBranchName, RalBranchType, ScalingFactor,
StartJobs, ShouldHibernate, PREEMPTABLE_ONLY, StartJobs,
EFFICIENT_DRAIN, KILL_SIGNAL, ONLY_LHCB
[root @ startd ~]# condor_config_val ONLY_LHCB
True
[2]
[root@schedd ~]# rpm -qa | grep condor
condor-procd-8.6.13-1.el6.x86_64
condor-classads-8.6.13-1.el6.x86_64
condor-python-8.6.13-1.el6.x86_64
condor-external-libs-8.6.13-1.el6.x86_64
condor-8.6.13-1.el6.x86_64
[root@schedd ~]# condor_config_val JOB_TRANSFORM_NAMES
, forcelhcb, DefaultDocker
[root@schedd ~]# condor_config_val JOB_TRANSFORM_forcelhcb
[
Requirements = JobUniverse == 5 && DockerImage =?= undefined && Owner
=!= "nagios" && x509UserProxyVOName == "lhcb";
set_Transformed = "forcelhcb";
set_WantDocker = true;
eval_set_DockerImage = ifThenElse(NordugridQueue =?= "EL7",
"stfc/grid-workernode-c7:2019-07-02.1",
"stfc/grid-workernode-c6:2019-07-02.1");
set_Requirements = TARGET.ONLY_LHCB;
copy_TransferInput = "OriginalTransferInput";
eval_set_TransferInput = strcat(OriginalTransferInput, ",", Cmd);
]
[3]
[root@schedd ~]# condor_q 6184050.0 -format '%s\n' Requirements
TARGET.ONLY_LHCB
[4]
[root@schedd ~]# condor_q -better-analyze 6184050.0
-- Schedd: xxxxx : <130.246.182.180:17549>
The Requirements expression for job 6184050.000 is
TARGET.ONLY_LHCB
Job 6184050.000 defines the following attributes:
The Requirements expression for job 6184050.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 6 TARGET.ONLY_LHCB
6184050.000: Run analysis summary ignoring user priority. Of 13097 machines,
12529 are rejected by your job's requirements
6 reject your job because of their own requirements
562 are exhausted partitionable slots
0 match and are already running your jobs
0 match but are serving other users
0 are available to run your job
WARNING: Be advised:
Job did not match any machines's constraints
To see why, pick a machine that you think should match and add
-reverse -machine <name>
to your query.
[5]
[root@schedd ~]# condor_q -better-analyze 6184050.0 -reverse -machine <startd>
-- Schedd: <schedd> : <130.246.182.180:17549>
-- Slot: slot1@<startd> : Analyzing matches for 1 Jobs in 1 autoclusters
The Requirements expression for this slot is
( START ) && ( IsValidCheckpointPlatform ) &&
( WithinResourceLimits )
START is
( NODE_IS_HEALTHY is true ) &&
( StartJobs is true ) && ( RecentJobStarts < 20 ) &&
( x509UserProxyVOName is "lhcb" ) &&
( ScheddHostName is "<schedd>" ) &&
( ( UtsnameRelease is "5.3.1-1.el7.elrepo.x86_64" ) ||
( x509UserProxyVOName is "lsst" ) ) &&
ifThenElse(Offline is undefined,true,( ( CurrentTime -
QDate ) >= 900 )) &&
ifThenElse(false,isPreemptable is true,true) &&
( false == false )
IsValidCheckpointPlatform is
( TARGET.JobUniverse isnt 1 ||
( ( MY.CheckpointPlatform isnt undefined ) &&
( ( TARGET.LastCheckpointPlatform is MY.CheckpointPlatform ) ||
( TARGET.NumCkpts == 0 ) ) ) )
WithinResourceLimits is
( ifThenElse(TARGET._condor_RequestCpus isnt undefined,MY.Cpus > 0 &&
TARGET._condor_RequestCpus <=
MY.Cpus,ifThenElse(TARGET.RequestCpus isnt undefined,MY.Cpus > 0 &&
TARGET.RequestCpus <= MY.Cpus,1 <= MY.Cpus)) &&
ifThenElse(TARGET._condor_RequestMemory isnt undefined,MY.Memory > 0 &&
TARGET._condor_RequestMemory <=
MY.Memory,ifThenElse(TARGET.RequestMemory isnt undefined,MY.Memory > 0
&&
TARGET.RequestMemory <= MY.Memory,false)) &&
ifThenElse(TARGET._condor_RequestDisk isnt undefined,MY.Disk > 0 &&
TARGET._condor_RequestDisk <=
MY.Disk,ifThenElse(TARGET.RequestDisk isnt undefined,MY.Disk > 0 &&
TARGET.RequestDisk <= MY.Disk,false)) )
This slot defines the following attributes:
CheckpointPlatform = "LINUX X86_64 5.3.1-1.el7.elrepo.x86_64
normal N/A avx avx2 ssse3 sse4_1 sse4_2"
Cpus = 128
Disk = 352654212
Memory = 696600
NODE_IS_HEALTHY = true && true && ( WantEchoXrootd =?= false ||
WantEchoXrootd =?= undefined )
RecentJobStarts = 0
StartJobs = true
UtsnameRelease = "5.3.1-1.el7.elrepo.x86_64"
Job 6184050.0 has the following attributes:
TARGET.QDate = 1591704386
TARGET.ScheddHostName = "<schedd>"
TARGET.JobUniverse = 5
TARGET.NumCkpts = 0
TARGET.RequestCpus = 1
TARGET.RequestDisk = 75
TARGET.RequestMemory = 4000
TARGET.WantEchoXrootd = true
TARGET.x509UserProxyVOName = "lhcb"
The Requirements expression for this slot reduces to these conditions:
Clusters
Step Matched Condition
----- -------- ---------
[0] 0 NODE_IS_HEALTHY is true
slot1@<startd>: Run analysis summary of 1 jobs.
0 (0.00 %) match both slot and job requirements.
0 match the requirements of this slot.
1 have job requirements that match this slot.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/