[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] RFC: Parallel classad evaluations
- Date: Tue, 10 Apr 2012 07:51:25 -0500
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [Condor-devel] RFC: Parallel classad evaluations
On Apr 10, 2012, at 6:27 AM, Matthew Farrellee wrote:
> On 04/08/2012 12:58 PM, Ziliang Guo wrote:
>> Inline
>>
>> On Sun, Apr 8, 2012 at 7:10 AM, Matthew Farrellee<matt@xxxxxxxxxx> wrote:
>>>
>>> inline
>>>
>>>
>>> On 04/07/2012 09:15 PM, Ziliang Guo wrote:
>>>
>>>> As some of you may be aware of, I'm working on parallelizing classad
>>>> matchmaking for my ME964 project. Right now I have an implementation
>>>> that takes in a single job classad and multiple machine ads and spins
>>>> up multiple threads to search for a match amongst the machine ads.
>>>> Another possibility is to have multiple threads each trying to find
>>>> matches for a different job, though that one is a bit more complicated
>>>> to pull off thanks to us modifying the machine ads during the
>>>> matchmaking process (old classad compatibility hacks ftw....). Does
>>>> anyone else have any suggestions for other variations that may be
>>>> useful or comments on the above two methods (keeping in mind only the
>>>> single job vs multiple machine match has been implemented and tested
>>>> so far)?
>>>
>>>
>>> i can think of two high level objectives here: 0) show that the
>>> algorithm can be parallelized; 1) make matchmaking faster. are you
>>> aiming for one of those or something else?
>>>
>>> i'd aim for making matchmaking faster.
>>>
>>> let me know what your objective is and i'll comment more.
>>>
>>
>> A little bit of both, mostly to make things faster. I haven't dug
>> deeply enough into the actual evaluation code yet to determine if the
>> individual evaluations could be done in parallel, but my gut reaction
>> is unless we can do like a thousand of them at the same time, the
>> sequential evaluation is so fast it's not worth it.
>>
>>> -
>>> an over-simplification of the algorithm is,
>>>
>>> 0: for j in jobs (a)
>>> 1: for m in machines (b)
>>> 2: if j accepts m and m accepts j
>>> 3: add m to j's matches
>>>
>>> current implementation parallelizes (b)?
>>> parallelization of (a) blocked by mutating classads?
>>>
>>> have you considered possibilities if you could swap (0) and (1)?
>>>
>>
>> Yes and it's complicated by the changes we make to the two classads we
>> try to match. Since we set the, I think it was alternate, member of
>> each classad to each other, we can't match the same instance of a
>> classad against multiple instances at the same time. It's why for the
>> current implementation I have to basically have four copies of the job
>> classad in order to do four matchmaking operations at the same time.
>> If we flip 0 and 1, we'd either need to make copies of the machine ad
>> for every candidate, or stagger the evaluations and force each set of
>> evaluations to all finish before moving onto the next set. At least
>> with the current order of operations, I can spin up the copies of the
>> job ad before I spawn the threads and not do any memory allocations or
>> clones inside the threads.
>>
>>> -
>>>
>>>
>>>> Long term there is also a question of whether there is any point in
>>>> trying to parallelize the parsing of a classad from the raw text into
>>>> its expression tree, or whether this is even possible. Really long
>>>
>>>
>>> i believe we can flyweight (save on memory and re-parsing)
>>> expressions as the result of work tim did on classads. if the
>>> objective is optimization, this isn't going to bear fruit.
>>>
>>>
>>>
>>>> term, would it be possible to generate data structures such that we
>>>> could attempt a match making operation on a GPU.
>>>
>>>
>>> that would be interesting.
>
>
> since you're going for matchmaking performance, you really need some base measurements. they'll truly guide where you go. if you find that threading (b) only helps after you hit 100K machines, stop. classad evaluation isn't going to be the first thing to optimize when you have that many machines - for instance, phase 1 of the negotiation cycle copies all of the collector memory over to the negotiator via a tcp socket.
>
> on the comment front, you should look into how much memory churn exists during the standard algorithm and during the parallel algorithm. any parallelization in the negotiator that adds memory load will scale poorly and has a good chance of decreasing performance.
I happened to be poking around with a memory profiler yesterday. It seems that memory allocations are one of the bottlenecks in the system.
I would seriously consider looking into trying a pool-based allocator for classads - especially the lexer. This gives you constant-time allocation. My understanding is that GCC switched to obstacks / pool-based allocators for some of their time-critical pieces.
It's far less sexy than parallelization, but potentially has a better payoff.
Finally - I highly suggest getting familiar with a profiler like igprof.
Brian