Re: [Gems-users] Issues with collecting memory access trace for logtm microbenchmark (tm-deque)


Date: Fri, 12 Jan 2007 17:48:29 -0500
From: Shougata Ghosh <shougata@xxxxxxxxxxxxx>
Subject: Re: [Gems-users] Issues with collecting memory access trace for logtm microbenchmark (tm-deque)
Hi Dan
Thanks for your quick reply.
I run the simulations with "ruby0.setparam_str REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH true".
Doesn't that mean FAST_PATH is disabled?
So when FAST_PATH is enabled, there is no way to tell from ruby_operate() if the request is a duplicate or not? Seems like mh_memorytracer_possible_cache_miss() will return 0 for both duplicate requests as well as L1 cache hit.
Thanks
shougata


>The return value may be zero for normal hits in the l1 cache when FAST_PATH is enabled... that could also explain some of your other issues.
>
>Shougata Ghosh wrote:
>> Thanks Jayaram and Dan for your replies. I was aware that simics sends
>> memory requests to ruby more than once. What I am doing inside
>> ruby_operate() is that I only record the transaction in my trace file if
>> the return value of mh_memorytracer_possible_cache_miss(mem_op) is
>> non-zero. Does that sound ok?
>> Creating a processor set with pset_create and then binding the threads
>> to the cpus of this set kept out all the other processes from
>> interfering with my benchmark.
>> Thanks again
>> shougata
>>
>> >>> From: Dan Gibson <degibson@xxxxxxxx>
>>> Subject: Re: [Gems-users] Issues with collecting memory access trace
>>>     for logtm microbenchmark (tm-deque)
>>> To: Gems Users <gems-users@xxxxxxxxxxx>
>>> Message-ID: <45A6459B.3030301@xxxxxxxx>
>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>
>>> Shougata,
>>> Let me take a stab at the areas that I am confident in answering
>>> correctly (eg. NOT LogTM):
>>>
>>> **Too many requests:
>>> Simics uses an "are you sure?" policy when issuing memory requests to
>>> Ruby. That is, each request is passed to ruby *twice* -- once to
>>> determine the stall time, and once when the stall time is elapsed and
>>> Simics is verifying that Ruby wants the operation to complete. These
>>> dual requests are handled in SimicsProcessor.C -- for convenience (both >>> of C++ language and for the filtering effect) you may want to move your >>> trace generation higher into Ruby's hierarchy (say, SimicsProcessor.C or
>>> Sequencer.C).
>>>
>>> **ASI:
>>> Contrary to the name, the ASI is used to specify not the process but the
>>> target address space. 128 is the vanilla address space for main
>>> memory... ASIs are detailed quite extensively in Sun's microprocessor
>>> manuals (google for "sun ultrasparc manual" and see the section on
>>> ASIs). For example, ASI 0x70 (aka 128) is ASI_BLOCK_AS_IF_USER_PRIMARY,
>>> which is for user-level accesses to main memory, ASI 0x58 is for
>>> accesses to the data TLB, 0x59 is the data TSB, etc. There is not one
>>> ASI per process.
>>>
>>> Regards,
>>> Dan
>>>
>>> Shougata Ghosh wrote:
>>>
>>> >>> >>> >>>> Hi
>>>> I am simulating 16 processor ultrasparc-iii with solaris10. I loaded
>>>> ruby (no opal) with simics. The protocol I used was
>>>> MESI_SMP_LogTm_directory and I was running tm-deque microbenchmark that
>>>> comes with GEMS. My goal was to collect the memory traces (only data
>>>> access, no instruction access) of tm-deque and analyse the trace file
>>>> offline.
>>>> Let me first give a brief overview of how I collect the traces.
>>>>
>>>> I print the clock_cycle (simics cycle), the cpu making the request, the >>>> physical address of the memory location, the type of access (r or w) and >>>> if this cpu is currently executing a xaction (logTm). The format looks
>>>> like this:
>>>>
>>>> cycle    cpu    phys_addr    type(r/w)    in_xaction
>>>>
>>>> This I print from inside ruby_operate() in ruby.c, since this function
>>>> is called for every memory access simics makes.
>>>> In addition to this, in a different trace file, I print when a xaction
>>>> begins, commits or aborts. This I print from
>>>> magic_instruction_callback() in commands.C. The format is following:
>>>>
>>>> cycle    cpu    xaction_type(B/C/A)    xaction_id(for nested xaction)
>>>>
>>>> Once the simulation is completed, I combine the two trace files and sort
>>>> it with the clock cycle field.
>>>>
>>>> *****The biggest issue is with having too many requests. I want to
>>>> isolate all the other processes making memory requests, except tm-deque.
>>>> Right now, I'm isolating the kernel requests by inspecting the priv
>>>> field in (v9_memory_transaction *) mem_op->priv. If the priv field is 1,
>>>> I don't record that transaction. I believe this effectively keeps the
>>>> kernel requests out of my trace. But there are other maintenance/service >>>> processes started by the kernel running in user space which access the
>>>> memory and I want to isolate them. I have tried to detect the pid or
>>>> some sort of a process id from inside ruby but haven't had any
>>>> success/luck so far! Things I have looked into are:
>>>>
>>>> - The ASID (address space id) field in (v9_memory_transaction *)
>>>> mem_op->asi. This didn't work!! The ASID was a fixed 128 throughout. One >>>> possible reason is that perhaps the ASID changes between user space and >>>> kernel space. Since I'm only recording user-space accesses, I don't see
>>>> any changes in ASID.
>>>>
>>>> - The content of global register g7. From inspecting the opensolaris
>>>> code, I noticed that the getpid() function gets the address of the
>>>> current_thread structure from %g7. It then gets a pointer to the process >>>> the current_thread belongs to from the current_thread structure. Next, >>>> it reads the process_id from the process structure. Since I don't care >>>> about the exact pid, I inspected the value of the %g7 register. I didn't
>>>> see any changes in that! One possibility was ofcourse %g7 stores the
>>>> virtual address which could be the same for all processes. If all the
>>>> processes are running just one thread, this seemed very likely. So, next
>>>> I looked into the corresponding physical address. Unfortunately, that
>>>> remained constant as well!
>>>> I'll try reading the content of the memory location pointed to by the
>>>> physical address (thread_phys_addr). Maybe that will have a different
>>>> value! I am yet to look into that.
>>>>
>>>> On a side, how does LogTm differentiate xactional requests from
>>>> non-xactional ones if they both come from the same processor??
>>>>
>>>> *****My second issue is with the clock cycle I print for timestamping. I
>>>> am using the SIM_clock_cycle to timestamp the memory accesses. When I
>>>> combine the two traces, I notice that after a xaction has begun,
>>>> subsequent memory accesses printed from ruby_operate() doesn't have
>>>> in_xaction set to 1! Here's an example of it:
>>>> 9067854    13    189086172    r    0
>>>> 9067856    13    185775464    w    0
>>>> 9068573    13    B    0            <- xaction begins
>>>> 9069382    13    185775464    w    0
>>>> 9069387    13    185775468    r    0
>>>> .
>>>> .
>>>> .
>>>> 9069558    13    185775468    w    0
>>>> 9069566    13    185775468    w    0
>>>> 9069611 13 185775272 r 1 <- first time in_xaction turns 1
>>>>
>>>> There's always a lag of about 1000 cycles between xaction Begin and
>>>> in_xaction turning into 1 in the memory access traces. I did make sure I >>>> set the cpu-switch-cycle to 1 in simics before I started my simulations!
>>>> I get the value of in_xaction in the following way:
>>>> #define XACT_MGR
>>>> g_system_ptr->getChip(SIMICS_current_processor_number()/RubyConfig::numberOfProcsPerChip())->getTransactionManager(SIMICS_current_processor_number()%RubyConfig::numberOfProcsPerChip())
>>>> in_xaction = XACT_MGR->inTransaction();
>>>>
>>>> As I metioned earlier, I get the clock_cycle from SIM_cycle_count(*cpu).
>>>> Any idea what could be causing this? Do you think I should try using
>>>> ruby_cycles instead?
>>>>
>>>> *****Third issue is specific to the LogTm microbenchmark I was running. >>>> I was using the LogTm tm-deque microbenchmark. I ran it with 10 threads
>>>> and set # of ops to 10. Initially I wanted small xactions without
>>>> conflicts. When I look at the trace file, I don't see any interleaving >>>> threads. The 10 threads ran one after the other in the following order:
>>>> thread        cpu    start_cycle
>>>> T1        13    9068573
>>>> T2        9    10035999
>>>> T3        13    10944933
>>>> T4        2    11654399
>>>> T5        9    11781161
>>>> T6        13    11886113
>>>> T7        4    16280785
>>>> T8        13    16495097
>>>> T9        0    16917327
>>>> T10        6    17562721
>>>>
>>>> Why aren't the threads running in parallel? The code dispatches all 10 >>>> threads in a for-loop and later does a thread_join. I am simulating 16
>>>> processors - I expected all 10 threads to run in parallel! Also, the
>>>> number of clock cycles between the end of one thread and the start of
>>>> the enxt one is quite large - itvaried from 200,000 to 900,000!
>>>> Am I doing something wrong with the way I am collecting the clock_cycle
>>>> with SIM_cycle_count(current_cpu) ?
>>>>
>>>> I would really appreciate if anyone could share their thoughts/ideas on
>>>> these issues.
>>>> Thanks a lot in advance.
>>>> -shougata
>>>>
>>>> _______________________________________________
>>>> Gems-users mailing list
>>>> Gems-users@xxxxxxxxxxx
>>>> https://lists.cs.wisc.edu/mailman/listinfo/gems-users
>>>> Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.
>>>>
>>>>
>>>>
>>>>
>>>> >>>> >>>> >>> >>> >>> >> _______________________________________________
>> Gems-users mailing list
>> Gems-users@xxxxxxxxxxx
>> https://lists.cs.wisc.edu/mailman/listinfo/gems-users
>> Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.
>>
>>
>>
[← Prev in Thread] Current Thread [Next in Thread→]