Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Quotas - accepting surplus but not too much surplus
- Date: Mon, 5 Aug 2013 10:53:35 -0500
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Quotas - accepting surplus but not too much surplus
On Aug 5, 2013, at 10:39 AM, Keith Chadwick <chadwick@xxxxxxxx> wrote:
> At Fermilab, we use quotas and we also wanted a mechanism to allow jobs to complete,
> yet implement preemption.
>
> So...
>
> We started by histogramming the job durations, and analyzed the histograms.
>
> The results for the ensemble of our workloads (pretty much independent of the
> individual workloads) was that job duration peaked between 4 and 6 hours, and
> there was an exponential falloff from the peak. More than 95% of jobs completed
> in less than 24 hours.
>
> The full analysis is available here:
>
> http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=3246
>
> Based on this analysis we set a preemption timeout of 24 hours.
>
Very interesting!
As a pro-tip, HTCondor 8.0 automatically calculates some similar histograms for you. From the output of "condor_status -l -schedd":
JobsRuntimesHistogramBuckets = "30Sec, 1Min, 3Min, 10Min, 30Min, 1Hr, 3Hr, 6Hr, 12Hr, 1Day, 2Day, 4Day, 8Day, 16Day"
JobsCompletedRuntimes = "49013, 12376, 10268, 5203, 73025, 15853, 27233, 32692, 34783, 23434, 10062, 8, 0, 0, 0"
Unfortunately, the histogram buckets are not sysadmin-customizable. Honestly, I haven't had much time to play with these locally. I suspect the uneven buckets would cause me heartache. It may also be useful to request the aggregate job runtime for each bucket instead of the job count.
(there's a similar mechanism for job memory usage)
> The results is that users get their "dedicated" slots (quotas actually) and can
> "opportunistically" use more than their quota. When sufficient quota'd users
> need slots, the opportunistic jobs are signaled that they should preempt with
> a preemption time of 24 hours. Since the above analysis shows that the typical
> job duration is less than 24 hours, the jobs get to complete, and the cluster
> reclaims the slot for the quota'd use.
>
This mechanism gets a little blurry with the use of pilot jobs; however, I know there are experiments which aim to come up with nicer preemption mechanisms for pilots.
Brian