Hi Dan
Thanks for your quick reply.
I run the simulations with "ruby0.setparam_str
REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH true".
Doesn't that mean FAST_PATH is disabled?
So when FAST_PATH is enabled, there is no way to tell from
ruby_operate() if the request is a duplicate or not? Seems like
mh_memorytracer_possible_cache_miss() will return 0 for both duplicate
requests as well as L1 cache hit.
Thanks
shougata
>The return value may be zero for normal hits in the l1 cache when
FAST_PATH is enabled... that could also explain some of your other issues.
>
>Shougata Ghosh wrote:
>> Thanks Jayaram and Dan for your replies. I was aware that simics sends
>> memory requests to ruby more than once. What I am doing inside
>> ruby_operate() is that I only record the transaction in my trace
file if
>> the return value of mh_memorytracer_possible_cache_miss(mem_op) is
>> non-zero. Does that sound ok?
>> Creating a processor set with pset_create and then binding the threads
>> to the cpus of this set kept out all the other processes from
>> interfering with my benchmark.
>> Thanks again
>> shougata
>>
>>
>>> From: Dan Gibson <degibson@xxxxxxxx>
>>> Subject: Re: [Gems-users] Issues with collecting memory access trace
>>> for logtm microbenchmark (tm-deque)
>>> To: Gems Users <gems-users@xxxxxxxxxxx>
>>> Message-ID: <45A6459B.3030301@xxxxxxxx>
>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>
>>> Shougata,
>>> Let me take a stab at the areas that I am confident in answering
>>> correctly (eg. NOT LogTM):
>>>
>>> **Too many requests:
>>> Simics uses an "are you sure?" policy when issuing memory requests to
>>> Ruby. That is, each request is passed to ruby *twice* -- once to
>>> determine the stall time, and once when the stall time is elapsed and
>>> Simics is verifying that Ruby wants the operation to complete. These
>>> dual requests are handled in SimicsProcessor.C -- for convenience
(both
>>> of C++ language and for the filtering effect) you may want to move
your
>>> trace generation higher into Ruby's hierarchy (say,
SimicsProcessor.C or
>>> Sequencer.C).
>>>
>>> **ASI:
>>> Contrary to the name, the ASI is used to specify not the process
but the
>>> target address space. 128 is the vanilla address space for main
>>> memory... ASIs are detailed quite extensively in Sun's microprocessor
>>> manuals (google for "sun ultrasparc manual" and see the section on
>>> ASIs). For example, ASI 0x70 (aka 128) is
ASI_BLOCK_AS_IF_USER_PRIMARY,
>>> which is for user-level accesses to main memory, ASI 0x58 is for
>>> accesses to the data TLB, 0x59 is the data TSB, etc. There is not one
>>> ASI per process.
>>>
>>> Regards,
>>> Dan
>>>
>>> Shougata Ghosh wrote:
>>>
>>>
>>>
>>>
>>>> Hi
>>>> I am simulating 16 processor ultrasparc-iii with solaris10. I loaded
>>>> ruby (no opal) with simics. The protocol I used was
>>>> MESI_SMP_LogTm_directory and I was running tm-deque microbenchmark
that
>>>> comes with GEMS. My goal was to collect the memory traces (only data
>>>> access, no instruction access) of tm-deque and analyse the trace file
>>>> offline.
>>>> Let me first give a brief overview of how I collect the traces.
>>>>
>>>> I print the clock_cycle (simics cycle), the cpu making the
request, the
>>>> physical address of the memory location, the type of access (r or
w) and
>>>> if this cpu is currently executing a xaction (logTm). The format
looks
>>>> like this:
>>>>
>>>> cycle cpu phys_addr type(r/w) in_xaction
>>>>
>>>> This I print from inside ruby_operate() in ruby.c, since this
function
>>>> is called for every memory access simics makes.
>>>> In addition to this, in a different trace file, I print when a
xaction
>>>> begins, commits or aborts. This I print from
>>>> magic_instruction_callback() in commands.C. The format is following:
>>>>
>>>> cycle cpu xaction_type(B/C/A) xaction_id(for nested xaction)
>>>>
>>>> Once the simulation is completed, I combine the two trace files
and sort
>>>> it with the clock cycle field.
>>>>
>>>> *****The biggest issue is with having too many requests. I want to
>>>> isolate all the other processes making memory requests, except
tm-deque.
>>>> Right now, I'm isolating the kernel requests by inspecting the priv
>>>> field in (v9_memory_transaction *) mem_op->priv. If the priv field
is 1,
>>>> I don't record that transaction. I believe this effectively keeps the
>>>> kernel requests out of my trace. But there are other
maintenance/service
>>>> processes started by the kernel running in user space which access
the
>>>> memory and I want to isolate them. I have tried to detect the pid or
>>>> some sort of a process id from inside ruby but haven't had any
>>>> success/luck so far! Things I have looked into are:
>>>>
>>>> - The ASID (address space id) field in (v9_memory_transaction *)
>>>> mem_op->asi. This didn't work!! The ASID was a fixed 128
throughout. One
>>>> possible reason is that perhaps the ASID changes between user
space and
>>>> kernel space. Since I'm only recording user-space accesses, I
don't see
>>>> any changes in ASID.
>>>>
>>>> - The content of global register g7. From inspecting the opensolaris
>>>> code, I noticed that the getpid() function gets the address of the
>>>> current_thread structure from %g7. It then gets a pointer to the
process
>>>> the current_thread belongs to from the current_thread structure.
Next,
>>>> it reads the process_id from the process structure. Since I don't
care
>>>> about the exact pid, I inspected the value of the %g7 register. I
didn't
>>>> see any changes in that! One possibility was ofcourse %g7 stores the
>>>> virtual address which could be the same for all processes. If all the
>>>> processes are running just one thread, this seemed very likely.
So, next
>>>> I looked into the corresponding physical address. Unfortunately, that
>>>> remained constant as well!
>>>> I'll try reading the content of the memory location pointed to by the
>>>> physical address (thread_phys_addr). Maybe that will have a different
>>>> value! I am yet to look into that.
>>>>
>>>> On a side, how does LogTm differentiate xactional requests from
>>>> non-xactional ones if they both come from the same processor??
>>>>
>>>> *****My second issue is with the clock cycle I print for
timestamping. I
>>>> am using the SIM_clock_cycle to timestamp the memory accesses. When I
>>>> combine the two traces, I notice that after a xaction has begun,
>>>> subsequent memory accesses printed from ruby_operate() doesn't have
>>>> in_xaction set to 1! Here's an example of it:
>>>> 9067854 13 189086172 r 0
>>>> 9067856 13 185775464 w 0
>>>> 9068573 13 B 0 <- xaction begins
>>>> 9069382 13 185775464 w 0
>>>> 9069387 13 185775468 r 0
>>>> .
>>>> .
>>>> .
>>>> 9069558 13 185775468 w 0
>>>> 9069566 13 185775468 w 0
>>>> 9069611 13 185775272 r 1 <- first time in_xaction
turns 1
>>>>
>>>> There's always a lag of about 1000 cycles between xaction Begin and
>>>> in_xaction turning into 1 in the memory access traces. I did make
sure I
>>>> set the cpu-switch-cycle to 1 in simics before I started my
simulations!
>>>> I get the value of in_xaction in the following way:
>>>> #define XACT_MGR
>>>>
g_system_ptr->getChip(SIMICS_current_processor_number()/RubyConfig::numberOfProcsPerChip())->getTransactionManager(SIMICS_current_processor_number()%RubyConfig::numberOfProcsPerChip())
>>>> in_xaction = XACT_MGR->inTransaction();
>>>>
>>>> As I metioned earlier, I get the clock_cycle from
SIM_cycle_count(*cpu).
>>>> Any idea what could be causing this? Do you think I should try using
>>>> ruby_cycles instead?
>>>>
>>>> *****Third issue is specific to the LogTm microbenchmark I was
running.
>>>> I was using the LogTm tm-deque microbenchmark. I ran it with 10
threads
>>>> and set # of ops to 10. Initially I wanted small xactions without
>>>> conflicts. When I look at the trace file, I don't see any
interleaving
>>>> threads. The 10 threads ran one after the other in the following
order:
>>>> thread cpu start_cycle
>>>> T1 13 9068573
>>>> T2 9 10035999
>>>> T3 13 10944933
>>>> T4 2 11654399
>>>> T5 9 11781161
>>>> T6 13 11886113
>>>> T7 4 16280785
>>>> T8 13 16495097
>>>> T9 0 16917327
>>>> T10 6 17562721
>>>>
>>>> Why aren't the threads running in parallel? The code dispatches
all 10
>>>> threads in a for-loop and later does a thread_join. I am
simulating 16
>>>> processors - I expected all 10 threads to run in parallel! Also, the
>>>> number of clock cycles between the end of one thread and the start of
>>>> the enxt one is quite large - itvaried from 200,000 to 900,000!
>>>> Am I doing something wrong with the way I am collecting the
clock_cycle
>>>> with SIM_cycle_count(current_cpu) ?
>>>>
>>>> I would really appreciate if anyone could share their
thoughts/ideas on
>>>> these issues.
>>>> Thanks a lot in advance.
>>>> -shougata
>>>>
>>>> _______________________________________________
>>>> Gems-users mailing list
>>>> Gems-users@xxxxxxxxxxx
>>>> https://lists.cs.wisc.edu/mailman/listinfo/gems-users
>>>> Use Google to search the GEMS Users mailing list by adding
"site:https://lists.cs.wisc.edu/archive/gems-users/" to your search.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>> _______________________________________________
>> Gems-users mailing list
>> Gems-users@xxxxxxxxxxx
>> https://lists.cs.wisc.edu/mailman/listinfo/gems-users
>> Use Google to search the GEMS Users mailing list by adding
"site:https://lists.cs.wisc.edu/archive/gems-users/" to your search.
>>
>>
>>
|