Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Minimize time to start job.
- Date: Wed, 14 Aug 2019 21:49:35 +0000
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Minimize time to start job.
On 8/13/2019 6:21 AM, don_vanchos wrote:
> Hello,
>
> I noticed that for the simplest vanilla jobs on my cluster, the time
> difference `job_ad["JobStartDate"] - job_ad["QDate"]` is from 5 to 20
> seconds. So quite a lot of time elapses between sending job to the queue
> and starting the process.
>
> Then I setÂNEGOTIATOR_CYCLE_DELAY setting to 0. And this time difference
> became equal from 0 to 1 second.
>
> My goal is to make job launch as fast as possible! What are the
> consequences if I make this setting equal to zero? Maybe performance
> degradation? Or maybe the wrong behavior in some cases?ÂIf the '0' value
> is harmful, then how can I minimize this time difference?
>
> P.S.ÂMany thanks to all for the answers to my questions in the
> neighboring branches and in this (in advance).
>
> --
> Sincerely yours,
> Ivan Ergunov
Hi Ivan,
How it works is the condor_schedd (running on your submit node)
maintains a set of execute node slots it has claimed. The time it takes
the schedd to start an idle job onto a slot it already has claimed is
typically very fast (sub-second). However, if the schedd does not have
a claimed slot available, it needs to ask the negotiator for a match ---
this is the 5 to 20 second delay you initially observed. So if you
submit 1000 jobs to a pool with 10 cpu cores it will take a few seconds
for the schedd to get the matches intially and start the first 10 jobs,
but jobs 11 through 1000 will start practically immediately when an
earlier job completes (because the schedd does not need to talk again to
the negotiator - it already has the slots claimed).
To answer your question above re NEGOTIATOR_CYCLE_DELAY : If your pool
is of modest size (e.g. ~ one thousand cores or less), is all located on
the same local-area network (i.e. your pool is not spread across a high
latency wide-area internet), and you have just one submit node (i.e. one
schedd) where you are submitting jobs, I think a NEGOTIATOR_CYCLE_DELAY
of 1 or 2 would be fine. Because the negotiator is mostly stateless, the
idea of NEGOTIATOR_CYCLE_DELAY to give time for the schedd to claim the
slots matched by negotiator and for this to be reflected in the
collector before the start of the next negotiator cycle so that the
negotiator does not waste time giving out the same resources over and
over again.
Note that negotiator cycle itself is started periodically (controlled by
config knob NEGOTIATOR_INTERVAL) or triggered whenever a condor_submit
command or a condor_reschedule command is issued.
Hope the above helps, feel free to ask any followup questions if the
above was unclear,
regards
Todd
--
Todd Tannenbaum University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257