Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] after setting JOB_TRANSFORM in the schedd, jobs are not picked up by hosts
- Date: Mon, 23 Nov 2020 09:56:46 +0000
- From: jcaballero.hep@xxxxxxxxx
- Subject: [HTCondor-users] after setting JOB_TRANSFORM in the schedd, jobs are not picked up by hosts
Hi,
This is a follow up from the previous thread "Multiple
JOB_TRANSFORMATION blocks not working on a schedd".
I think I fixed the configuration at the schedd. But still matchmaking
is not working for me as I need.
I am trying to make the matchmaker to force a certain type of jobs
(those from VO lhcb) to run only on a selected set of machines.
In order to do this, I have added a special classads to those
machines, as in [1],
and I am trying to add that classad in the Requirements expression of
these jobs, via JOB_TRANSFORM [2], as we discussed in the other
thread.
Jobs look like this [3].
However, those jobs are not being picked up.
Those machines are currently busy running other jobs,
but I thought that would be indicated in the output of condor_q
-analyze, something like
N match but are serving other users
That is not the case [4].
It says that 106 slots matched, and yet there is no successful matching.
Adding -reverse -machine options seems to indicate that the issue is
that the jobs don't meet some requirements from the machine [5].
That surprises me a little bit, since I do not remove or overwrite any
job attribute in the second JOB_TRANSFORM block, except Requirements.
Indeed, the same command on a production schedd, against a job that is
currently running, also gives me zeros in the "Slot's Req Matches Job"
column.
So I may not be understanding what it means....
Any tip on how to troubleshoot this lack of matching is more than welcome.
Thanks a lot in advance.
Cheers,
Jose
======================================================================
[1]
[root@machine01 ~]# condor_config_val -dump | grep STARTD_ATTR
STARTD_ATTRS = <... other attributes...>, ShouldHibernate,
PREEMPTABLE_ONLY, StartJobs, EFFICIENT_DRAIN, KILL_SIGNAL, ONLY_LHCB
SYSTEM_STARTD_ATTRS = COLLECTOR_HOST_STRING DedicatedScheduler
[root@lcg1863 config.d]# condor_config_val ONLY_LHCB
True
======================================================================
[2]
JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES), DefaultDocker, forcelhcb
JOB_TRANSFORM_DefaultDocker @=end
[
Requirements = JobUniverse == 5 && DockerImage =?=
undefined && Owner =!= "nagios";
set_WantDocker = true;
set_Requirements = ( TARGET.HasDocker ) && ( TARGET.Disk >=
RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus
>= RequestCpus ) && ( TARGET.HasFileTransfer ) && (
x509UserProxyVOName =?= "atlas" && NumJobStarts == 0 ||
x509UserProxyVOName =!= "atlas");
copy_TransferInput = "OriginalTransferInput";
eval_set_TransferInput = strcat(OriginalTransferInput, ",", Cmd);
set_PeriodicRemove = ( (RemoteUserCpu + RemoteSysCpu >
JobCpuLimit) ?: False ) || ( (RemoteWallClockTime > JobTimeLimit) ?:
False )
]
@end
JOB_TRANSFORM_forcelhcb @=end
[
Requirements = JobUniverse == 5 && x509UserProxyVOName ==
"lhcb" && ScheddHostName == "ce-test";
set_Requirements = TARGET.ONLY_LHCB;
]
@end
======================================================================
[3]
[root@ce-test ~]# condor_q -l 2399.0 | grep ^Requirements
Requirements = TARGET.ONLY_LHCB
======================================================================
[4]
[root@ce-test ~]# condor_q -better-analyze 2399.0
-- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
The Requirements expression for job 2399.000 is
TARGET.ONLY_LHCB
Job 2399.000 defines the following attributes:
The Requirements expression for job 2399.000 reduces to these
conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 106 TARGET.ONLY_LHCB
No successful match recorded.
Last failed match: Mon Nov 23 08:56:10 2020
Reason for last match failure: no match found
2399.000: Run analysis summary ignoring user priority. Of
15795 machines,
15126 are rejected by your job's requirements
104 reject your job because of their own requirements
565 are exhausted partitionable slots
0 match and are already running your jobs
0 match but are serving other users
0 are available to run your job
WARNING: Be advised:
Job did not match any machines's constraints
To see why, pick a machine that you think should match and add
-reverse -machine <name>
to your query.
======================================================================
[5]
[root@ce-test ~]# condor_q -analyze -reverse -machine machine01 2399.0
-- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
2399.0: Analyzing matches for 1 job
Slot Slot's Req Job's Req Both
Name Type Matches Job Matches Slot Match %
------------------- ----- ------------ ------------ ----------
slot1@machine01 Part 0 1 0.00
slot1_10@machine01 Dyn 0 1 0.00
slot1_11@machine01 Dyn 0 1 0.00
slot1_12@machine01 Dyn 0 1 0.00
slot1_14@machine01 Dyn 0 1 0.00
slot1_15@machine01 Dyn 0 1 0.00
slot1_16@machine01 Dyn 0 1 0.00
slot1_18@machine01 Dyn 0 1 0.00
slot1_19@machine01 Dyn 0 1 0.00
slot1_1@machine01 Dyn 0 1 0.00
slot1_20@machine01 Dyn 0 1 0.00
slot1_21@machine01 Dyn 0 1 0.00
slot1_22@machine01 Dyn 0 1 0.00
slot1_24@machine01 Dyn 0 1 0.00
slot1_26@machine01 Dyn 0 1 0.00
slot1_29@machine01 Dyn 0 1 0.00
slot1_2@machine01 Dyn 0 1 0.00
slot1_30@machine01 Dyn 0 1 0.00
slot1_32@machine01 Dyn 0 1 0.00
slot1_3@machine01 Dyn 0 1 0.00
slot1_5@machine01 Dyn 0 1 0.00
slot1_9@machine01 Dyn 0 1 0.00