Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] after setting JOB_TRANSFORM in the schedd, jobs are not picked up by hosts
- Date: Mon, 30 Nov 2020 08:49:49 +0000
- From: jcaballero.hep@xxxxxxxxx
- Subject: Re: [HTCondor-users] after setting JOB_TRANSFORM in the schedd, jobs are not picked up by hosts
Hi TJ,
thanks a lot. Quite useful, indeed.
Cheers,
Jose
El mar, 24 nov 2020 a las 17:25, John M Knoeller
(<johnkn@xxxxxxxxxxx>) escribiÃ:
>
> It's hard to know for sure, but it looks like your ONLY_LHCB slots are just busy running other jobs.
>
> You can see that here
>
> 2399.000: Run analysis summary ignoring user priority. Of
> 15795 machines,
> 15126 are rejected by your job's requirements
> 104 reject your job because of their own requirements
> -->>> 565 are exhausted partitionable slots
>
> And also here
>
> -- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
> 2399.0: Analyzing matches for 1 job
> Slot Slot's Req Job's Req Both
> Name Type Matches Job Matches Slot Match %
> ------------------- ----- ------------ ------------ ----------
> slot1@machine01 Part 0 1 0.00
> slot1_10@machine01 Dyn 0 1 0.00
> .....
>
> It looks like the Dynamic slots are too small to match, and the p-slot is out of resources so it doesn't match either.
>
> -tj
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of jcaballero.hep@xxxxxxxxx
> Sent: Monday, November 23, 2020 3:57 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] after setting JOB_TRANSFORM in the schedd, jobs are not picked up by hosts
>
> Hi,
>
> This is a follow up from the previous thread "Multiple
> JOB_TRANSFORMATION blocks not working on a schedd".
> I think I fixed the configuration at the schedd. But still matchmaking
> is not working for me as I need.
>
> I am trying to make the matchmaker to force a certain type of jobs
> (those from VO lhcb) to run only on a selected set of machines.
> In order to do this, I have added a special classads to those
> machines, as in [1],
> and I am trying to add that classad in the Requirements expression of
> these jobs, via JOB_TRANSFORM [2], as we discussed in the other
> thread.
> Jobs look like this [3].
>
> However, those jobs are not being picked up.
>
> Those machines are currently busy running other jobs,
> but I thought that would be indicated in the output of condor_q
> -analyze, something like
>
> N match but are serving other users
>
> That is not the case [4].
> It says that 106 slots matched, and yet there is no successful matching.
>
> Adding -reverse -machine options seems to indicate that the issue is
> that the jobs don't meet some requirements from the machine [5].
> That surprises me a little bit, since I do not remove or overwrite any
> job attribute in the second JOB_TRANSFORM block, except Requirements.
> Indeed, the same command on a production schedd, against a job that is
> currently running, also gives me zeros in the "Slot's Req Matches Job"
> column.
> So I may not be understanding what it means....
>
> Any tip on how to troubleshoot this lack of matching is more than welcome.
>
> Thanks a lot in advance.
> Cheers,
> Jose
>
> ======================================================================
>
> [1]
>
> [root@machine01 ~]# condor_config_val -dump | grep STARTD_ATTR
> STARTD_ATTRS = <... other attributes...>, ShouldHibernate,
> PREEMPTABLE_ONLY, StartJobs, EFFICIENT_DRAIN, KILL_SIGNAL, ONLY_LHCB
> SYSTEM_STARTD_ATTRS = COLLECTOR_HOST_STRING DedicatedScheduler
>
> [root@lcg1863 config.d]# condor_config_val ONLY_LHCB
> True
>
> ======================================================================
>
> [2]
>
> JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES), DefaultDocker, forcelhcb
>
> JOB_TRANSFORM_DefaultDocker @=end
> [
> Requirements = JobUniverse == 5 && DockerImage =?=
> undefined && Owner =!= "nagios";
> set_WantDocker = true;
> set_Requirements = ( TARGET.HasDocker ) && ( TARGET.Disk >=
> RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus
> >= RequestCpus ) && ( TARGET.HasFileTransfer ) && (
> x509UserProxyVOName =?= "atlas" && NumJobStarts == 0 ||
> x509UserProxyVOName =!= "atlas");
> copy_TransferInput = "OriginalTransferInput";
> eval_set_TransferInput = strcat(OriginalTransferInput, ",", Cmd);
> set_PeriodicRemove = ( (RemoteUserCpu + RemoteSysCpu >
> JobCpuLimit) ?: False ) || ( (RemoteWallClockTime > JobTimeLimit) ?:
> False )
> ]
> @end
>
>
> JOB_TRANSFORM_forcelhcb @=end
> [
> Requirements = JobUniverse == 5 && x509UserProxyVOName ==
> "lhcb" && ScheddHostName == "ce-test";
> set_Requirements = TARGET.ONLY_LHCB;
> ]
> @end
>
>
> ======================================================================
>
> [3]
>
> [root@ce-test ~]# condor_q -l 2399.0 | grep ^Requirements
> Requirements = TARGET.ONLY_LHCB
>
> ======================================================================
>
> [4]
>
>
> [root@ce-test ~]# condor_q -better-analyze 2399.0
>
> -- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
> The Requirements expression for job 2399.000 is
>
> TARGET.ONLY_LHCB
>
> Job 2399.000 defines the following attributes:
>
>
> The Requirements expression for job 2399.000 reduces to these
> conditions:
>
> Slots
> Step Matched Condition
> ----- -------- ---------
> [0] 106 TARGET.ONLY_LHCB
>
> No successful match recorded.
> Last failed match: Mon Nov 23 08:56:10 2020
>
> Reason for last match failure: no match found
>
> 2399.000: Run analysis summary ignoring user priority. Of
> 15795 machines,
> 15126 are rejected by your job's requirements
> 104 reject your job because of their own requirements
> 565 are exhausted partitionable slots
> 0 match and are already running your jobs
> 0 match but are serving other users
> 0 are available to run your job
>
> WARNING: Be advised:
> Job did not match any machines's constraints
> To see why, pick a machine that you think should match and add
> -reverse -machine <name>
> to your query.
>
>
> ======================================================================
>
> [5]
>
> [root@ce-test ~]# condor_q -analyze -reverse -machine machine01 2399.0
>
>
> -- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
> 2399.0: Analyzing matches for 1 job
> Slot Slot's Req Job's Req Both
> Name Type Matches Job Matches Slot Match %
> ------------------- ----- ------------ ------------ ----------
> slot1@machine01 Part 0 1 0.00
> slot1_10@machine01 Dyn 0 1 0.00
> slot1_11@machine01 Dyn 0 1 0.00
> slot1_12@machine01 Dyn 0 1 0.00
> slot1_14@machine01 Dyn 0 1 0.00
> slot1_15@machine01 Dyn 0 1 0.00
> slot1_16@machine01 Dyn 0 1 0.00
> slot1_18@machine01 Dyn 0 1 0.00
> slot1_19@machine01 Dyn 0 1 0.00
> slot1_1@machine01 Dyn 0 1 0.00
> slot1_20@machine01 Dyn 0 1 0.00
> slot1_21@machine01 Dyn 0 1 0.00
> slot1_22@machine01 Dyn 0 1 0.00
> slot1_24@machine01 Dyn 0 1 0.00
> slot1_26@machine01 Dyn 0 1 0.00
> slot1_29@machine01 Dyn 0 1 0.00
> slot1_2@machine01 Dyn 0 1 0.00
> slot1_30@machine01 Dyn 0 1 0.00
> slot1_32@machine01 Dyn 0 1 0.00
> slot1_3@machine01 Dyn 0 1 0.00
> slot1_5@machine01 Dyn 0 1 0.00
> slot1_9@machine01 Dyn 0 1 0.00
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/