Re: [HTCondor-devel] cpu affinity and partitionable slots


Date: Mon, 12 Nov 2012 13:16:58 -0600
From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] cpu affinity and partitionable slots
On Nov 12, 2012, at 9:11 AM, Tim St Clair <tstclair@xxxxxxxxxx> wrote:

> of note: current cpu_shares (which only exists on master) uses SlotWeight, where I think it should really be TotalSlotCpus.
> 

Matt called me out on the above also; SlotWeight is probably more flexible, but possibly overloads the meaning of SlotWeight.  I'm a bit ambivalent and would be fine with changing.

> Open Questions:
> Does anyone have a good way of *really* testing cpu_shares? 
> What does over-subscription and fractions actually mean, or do we want to stick with whole numbers?  
> ++How does this^ affect performance?  
> 

I have no good way to *really* test them out (whatever that means), but we've used cpu_share for the last 6 months and have anecdotal evidence.

It has to be an integer value.  It seems to only matter to compare the relative shares within sibling cgroups.  For example, /condor/ has cpu_shares=1024 (the default), but /condor/<job ID> has cpu_shares=100 for each job.

We've had a few cases where someone would send a multicore job but only request 1 CPU.  In this case, we've verified:
1) If the system is busy, the multicore job gets only 1 core worth of CPU time (the amount allocated).
2) If cycles would otherwise go unused, the multicore job gets those.

So, it works as described.  That's the good news.

The bad news is that we have seen CPU-scheduler-related kernel panics on RHEL 6.3; while quite rare, I think they're cgroups-related.  Maybe one a week?

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature

[← Prev in Thread] Current Thread [Next in Thread→]