Re: [Gems-users] Issues with collecting memory access trace for logtm microbenchmark (tm-deque)


Date: Thu, 11 Jan 2007 08:11:39 -0600
From: Dan Gibson <degibson@xxxxxxxx>
Subject: Re: [Gems-users] Issues with collecting memory access trace for logtm microbenchmark (tm-deque)
Shougata,
Let me take a stab at the areas that I am confident in answering correctly (eg. NOT LogTM):

**Too many requests:
Simics uses an "are you sure?" policy when issuing memory requests to Ruby. That is, each request is passed to ruby *twice* -- once to determine the stall time, and once when the stall time is elapsed and Simics is verifying that Ruby wants the operation to complete. These dual requests are handled in SimicsProcessor.C -- for convenience (both of C++ language and for the filtering effect) you may want to move your trace generation higher into Ruby's hierarchy (say, SimicsProcessor.C or Sequencer.C).

**ASI:
Contrary to the name, the ASI is used to specify not the process but the target address space. 128 is the vanilla address space for main memory... ASIs are detailed quite extensively in Sun's microprocessor manuals (google for "sun ultrasparc manual" and see the section on ASIs). For example, ASI 0x70 (aka 128) is ASI_BLOCK_AS_IF_USER_PRIMARY, which is for user-level accesses to main memory, ASI 0x58 is for accesses to the data TLB, 0x59 is the data TSB, etc. There is not one ASI per process.

Regards,
Dan

Shougata Ghosh wrote:

Hi
I am simulating 16 processor ultrasparc-iii with solaris10. I loaded ruby (no opal) with simics. The protocol I used was MESI_SMP_LogTm_directory and I was running tm-deque microbenchmark that comes with GEMS. My goal was to collect the memory traces (only data access, no instruction access) of tm-deque and analyse the trace file offline.
Let me first give a brief overview of how I collect the traces.

I print the clock_cycle (simics cycle), the cpu making the request, the physical address of the memory location, the type of access (r or w) and if this cpu is currently executing a xaction (logTm). The format looks like this:

cycle    cpu    phys_addr    type(r/w)    in_xaction

This I print from inside ruby_operate() in ruby.c, since this function is called for every memory access simics makes. In addition to this, in a different trace file, I print when a xaction begins, commits or aborts. This I print from magic_instruction_callback() in commands.C. The format is following:

cycle    cpu    xaction_type(B/C/A)    xaction_id(for nested xaction)

Once the simulation is completed, I combine the two trace files and sort it with the clock cycle field.

*****The biggest issue is with having too many requests. I want to isolate all the other processes making memory requests, except tm-deque. Right now, I'm isolating the kernel requests by inspecting the priv field in (v9_memory_transaction *) mem_op->priv. If the priv field is 1, I don't record that transaction. I believe this effectively keeps the kernel requests out of my trace. But there are other maintenance/service processes started by the kernel running in user space which access the memory and I want to isolate them. I have tried to detect the pid or some sort of a process id from inside ruby but haven't had any success/luck so far! Things I have looked into are:

- The ASID (address space id) field in (v9_memory_transaction *) mem_op->asi. This didn't work!! The ASID was a fixed 128 throughout. One possible reason is that perhaps the ASID changes between user space and kernel space. Since I'm only recording user-space accesses, I don't see any changes in ASID.

- The content of global register g7. From inspecting the opensolaris code, I noticed that the getpid() function gets the address of the current_thread structure from %g7. It then gets a pointer to the process the current_thread belongs to from the current_thread structure. Next, it reads the process_id from the process structure. Since I don't care about the exact pid, I inspected the value of the %g7 register. I didn't see any changes in that! One possibility was ofcourse %g7 stores the virtual address which could be the same for all processes. If all the processes are running just one thread, this seemed very likely. So, next I looked into the corresponding physical address. Unfortunately, that remained constant as well! I'll try reading the content of the memory location pointed to by the physical address (thread_phys_addr). Maybe that will have a different value! I am yet to look into that.

On a side, how does LogTm differentiate xactional requests from non-xactional ones if they both come from the same processor??

*****My second issue is with the clock cycle I print for timestamping. I am using the SIM_clock_cycle to timestamp the memory accesses. When I combine the two traces, I notice that after a xaction has begun, subsequent memory accesses printed from ruby_operate() doesn't have in_xaction set to 1! Here's an example of it:
9067854    13    189086172    r    0
9067856    13    185775464    w    0
9068573    13    B    0            <- xaction begins
9069382    13    185775464    w    0
9069387    13    185775468    r    0
.
.
.
9069558    13    185775468    w    0
9069566    13    185775468    w    0
9069611    13    185775272    r    1    <- first time in_xaction turns 1

There's always a lag of about 1000 cycles between xaction Begin and in_xaction turning into 1 in the memory access traces. I did make sure I set the cpu-switch-cycle to 1 in simics before I started my simulations! I get the value of in_xaction in the following way: #define XACT_MGR g_system_ptr->getChip(SIMICS_current_processor_number()/RubyConfig::numberOfProcsPerChip())->getTransactionManager(SIMICS_current_processor_number()%RubyConfig::numberOfProcsPerChip())
in_xaction = XACT_MGR->inTransaction();

As I metioned earlier, I get the clock_cycle from SIM_cycle_count(*cpu). Any idea what could be causing this? Do you think I should try using ruby_cycles instead?

*****Third issue is specific to the LogTm microbenchmark I was running. I was using the LogTm tm-deque microbenchmark. I ran it with 10 threads and set # of ops to 10. Initially I wanted small xactions without conflicts. When I look at the trace file, I don't see any interleaving threads. The 10 threads ran one after the other in the following order:
thread        cpu    start_cycle
T1        13    9068573
T2        9    10035999
T3        13    10944933
T4        2    11654399
T5        9    11781161
T6        13    11886113
T7        4    16280785
T8        13    16495097
T9        0    16917327
T10        6    17562721

Why aren't the threads running in parallel? The code dispatches all 10 threads in a for-loop and later does a thread_join. I am simulating 16 processors - I expected all 10 threads to run in parallel! Also, the number of clock cycles between the end of one thread and the start of the enxt one is quite large - itvaried from 200,000 to 900,000! Am I doing something wrong with the way I am collecting the clock_cycle with SIM_cycle_count(current_cpu) ?

I would really appreciate if anyone could share their thoughts/ideas on these issues.
Thanks a lot in advance.
-shougata

_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.


[← Prev in Thread] Current Thread [Next in Thread→]