[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dynamic Slots & Parallel Universe



I should note that this is an inherent problem with any deterministic model for assignment.

Any attempt at an optimal solution is NP-Complete (it's bin packing) and it's even worse given that the optimal solution may change on each and every state change/submitted job[1].

The only way to achieve 'fair' throughput in such a world is to either lose throughput from kicking other jobs to make room or lose throughput from reserving slots that might otherwise be used in the time it takes for other slots to become free.
Writing a decent general case solution to this would be a very good PhD thesis.

How we've dealt with similar issues is partitioning the pool such that such jobs have dedicated resources which will favour (i.e. via Machine RANK) jobs marked as needing this resource. Other jobs may choose to run on these machines, but such jobs are 'speculative' they accept they stand a chance of being killed without warning.

This works well for us because our jobs themselves tend towards a 'two hump' model with many short jobs which can 'fill in the cracks' and their users thus appreciate the additional potential throughput with the proviso on being evicted.

This is still not ideal, it would be nice for example to be able to predict when certain jobs in the 'short' category are actually 'medium' and avoid scheduling them to such machines where possible. Such fine grained control is possible with condor, it's just the estimation itself that is hard.

Short answer: don't expect this to get fixed anytime soon. Attempt to exploit your local domain specific criteria/model to work around it.

Matt

[1] It gets even worse when you consider a non heterogeneous pool, and dependencies of access to data.

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Joan J. Piles
Sent: 31 August 2010 16:30
To: Condor-Users Mail List
Cc: eje@xxxxxxxxxx
Subject: Re: [Condor-users] Dynamic Slots & Parallel Universe

Hi all:

We have here a very similar problem, only that more general. We not only 
have parallel jobs that may request a full machine (or several), but the 
following situation:

* Our cluster is at full usage with 1-cpu jobs (this means that with the 
dynamic partitioning, each slot has only one cpu).
* A user with high priority demands a n-cpu job, with n>1. There is no 
slot with cpus>1, so no slot is even considered for eviction. Then no 
slot can be assigned to him, because the scheduler currently doesn't do 
"multiple slot evictions" in order to free up resources for a higher 
priority, more resource hungry, job.
* Other lower priority, lower requirements jobs keep getting queued, and 
fill the slots as soon as they become free.

Then, the high priority, high demanding job is starved and will never be 
able to run as long as the cluster is in a high usage state.

The parallel universe then is only one occurrence of this more general 
problem.

Are there any plans to address this inherent limitation of the dynamic 
slot model?

Thanks in advance.

Joan

El 31/08/10 17:04, David J. Herzfeld escribió:
> Hi Erik:
>
> Thanks for the response. From the remarks in the ticket, this looks to
> be exactly what we want to #3! Is there any estimate on when this will
> get incorporated into the stable release?
>
> This is exciting.
>
> David
>
> On 08/31/2010 09:42 AM, Erik Erlandson wrote:
>    
>> Regarding dynamic slots and parallel universe:  The dedicated scheduler
>> (used by PU jobs) does not currently handle dynamic slots correctly.   A
>> patch to correct this has been submitted and is pending review:
>>
>> https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=986,0
>>
>>
>> -Erik
>>
>>
>>
>> On Tue, 2010-08-31 at 08:56 -0500, David J. Herzfeld wrote:
>>      
>>> Hi All:
>>>
>>> We have currently been working on a 1024 core cluster (8 cores per
>>> machines) using a pretty standard Condor config. Each core shows up as a
>>> single slot, etc.
>>>
>>> Users are starting to use multi-process jobs on the cluster - leading to
>>> over scheduling. One way to combat this problem is the "whole machine"
>>> configuration presented on the Wiki at
>>> <https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots>.
>>> However, most of our users don't require the full machine (combinations
>>> of 2, 3, 4, 5.. cores). We could modify this config to supply slots for
>>> 1/2 a machine, etc.
>>>
>>> So a couple of questions:
>>> 1) Does this seem like a job for dynamic slots? or should we modify the
>>> "whole machine" config?
>>>
>>> 2) If dynamic slots are the way to go, has this shown to be stable in
>>> production environments?
>>>
>>> 3) Can we combine the dynamic slot allocations with the Parallel
>>> Universe to provide similar-to-PBS allocations. Something like
>>> machine_count = 4
>>> request_cpus = 8
>>>
>>> To match 4 machines with 8 CPUs a piece? Similar to
>>> #PBS -l nodes=4:ppn=8
>>>
>>> As always - thanks a lot!
>>> David
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>        
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>      
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>    


-- 
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 976 76 10 00 (ext. 5454)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

----
Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.
The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.
Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Gloucester Research Limited is a company registered in England and Wales with company number 04267560.
----