Shougata,
Let me take a stab at the areas that I am confident in answering
correctly (eg. NOT LogTM):
**Too many requests:
Simics uses an "are you sure?" policy when issuing memory requests to
Ruby. That is, each request is passed to ruby *twice* -- once to
determine the stall time, and once when the stall time is elapsed and
Simics is verifying that Ruby wants the operation to complete. These
dual requests are handled in SimicsProcessor.C -- for convenience (both
of C++ language and for the filtering effect) you may want to move your
trace generation higher into Ruby's hierarchy (say, SimicsProcessor.C or
Sequencer.C).
**ASI:
Contrary to the name, the ASI is used to specify not the process but the
target address space. 128 is the vanilla address space for main
memory... ASIs are detailed quite extensively in Sun's microprocessor
manuals (google for "sun ultrasparc manual" and see the section on
ASIs). For example, ASI 0x70 (aka 128) is ASI_BLOCK_AS_IF_USER_PRIMARY,
which is for user-level accesses to main memory, ASI 0x58 is for
accesses to the data TLB, 0x59 is the data TSB, etc. There is not one
ASI per process.
Regards,
Dan
Shougata Ghosh wrote:
Hi
I am simulating 16 processor ultrasparc-iii with solaris10. I loaded
ruby (no opal) with simics. The protocol I used was
MESI_SMP_LogTm_directory and I was running tm-deque microbenchmark that
comes with GEMS. My goal was to collect the memory traces (only data
access, no instruction access) of tm-deque and analyse the trace file
offline.
Let me first give a brief overview of how I collect the traces.
I print the clock_cycle (simics cycle), the cpu making the request, the
physical address of the memory location, the type of access (r or w) and
if this cpu is currently executing a xaction (logTm). The format looks
like this:
cycle cpu phys_addr type(r/w) in_xaction
This I print from inside ruby_operate() in ruby.c, since this function
is called for every memory access simics makes.
In addition to this, in a different trace file, I print when a xaction
begins, commits or aborts. This I print from
magic_instruction_callback() in commands.C. The format is following:
cycle cpu xaction_type(B/C/A) xaction_id(for nested xaction)
Once the simulation is completed, I combine the two trace files and sort
it with the clock cycle field.
*****The biggest issue is with having too many requests. I want to
isolate all the other processes making memory requests, except tm-deque.
Right now, I'm isolating the kernel requests by inspecting the priv
field in (v9_memory_transaction *) mem_op->priv. If the priv field is 1,
I don't record that transaction. I believe this effectively keeps the
kernel requests out of my trace. But there are other maintenance/service
processes started by the kernel running in user space which access the
memory and I want to isolate them. I have tried to detect the pid or
some sort of a process id from inside ruby but haven't had any
success/luck so far! Things I have looked into are:
- The ASID (address space id) field in (v9_memory_transaction *)
mem_op->asi. This didn't work!! The ASID was a fixed 128 throughout. One
possible reason is that perhaps the ASID changes between user space and
kernel space. Since I'm only recording user-space accesses, I don't see
any changes in ASID.
- The content of global register g7. From inspecting the opensolaris
code, I noticed that the getpid() function gets the address of the
current_thread structure from %g7. It then gets a pointer to the process
the current_thread belongs to from the current_thread structure. Next,
it reads the process_id from the process structure. Since I don't care
about the exact pid, I inspected the value of the %g7 register. I didn't
see any changes in that! One possibility was ofcourse %g7 stores the
virtual address which could be the same for all processes. If all the
processes are running just one thread, this seemed very likely. So, next
I looked into the corresponding physical address. Unfortunately, that
remained constant as well!
I'll try reading the content of the memory location pointed to by the
physical address (thread_phys_addr). Maybe that will have a different
value! I am yet to look into that.
On a side, how does LogTm differentiate xactional requests from
non-xactional ones if they both come from the same processor??
*****My second issue is with the clock cycle I print for timestamping. I
am using the SIM_clock_cycle to timestamp the memory accesses. When I
combine the two traces, I notice that after a xaction has begun,
subsequent memory accesses printed from ruby_operate() doesn't have
in_xaction set to 1! Here's an example of it:
9067854 13 189086172 r 0
9067856 13 185775464 w 0
9068573 13 B 0 <- xaction begins
9069382 13 185775464 w 0
9069387 13 185775468 r 0
.
.
.
9069558 13 185775468 w 0
9069566 13 185775468 w 0
9069611 13 185775272 r 1 <- first time in_xaction turns 1
There's always a lag of about 1000 cycles between xaction Begin and
in_xaction turning into 1 in the memory access traces. I did make sure I
set the cpu-switch-cycle to 1 in simics before I started my simulations!
I get the value of in_xaction in the following way:
#define XACT_MGR
g_system_ptr->getChip(SIMICS_current_processor_number()/RubyConfig::numberOfProcsPerChip())->getTransactionManager(SIMICS_current_processor_number()%RubyConfig::numberOfProcsPerChip())
in_xaction = XACT_MGR->inTransaction();
As I metioned earlier, I get the clock_cycle from SIM_cycle_count(*cpu).
Any idea what could be causing this? Do you think I should try using
ruby_cycles instead?
*****Third issue is specific to the LogTm microbenchmark I was running.
I was using the LogTm tm-deque microbenchmark. I ran it with 10 threads
and set # of ops to 10. Initially I wanted small xactions without
conflicts. When I look at the trace file, I don't see any interleaving
threads. The 10 threads ran one after the other in the following order:
thread cpu start_cycle
T1 13 9068573
T2 9 10035999
T3 13 10944933
T4 2 11654399
T5 9 11781161
T6 13 11886113
T7 4 16280785
T8 13 16495097
T9 0 16917327
T10 6 17562721
Why aren't the threads running in parallel? The code dispatches all 10
threads in a for-loop and later does a thread_join. I am simulating 16
processors - I expected all 10 threads to run in parallel! Also, the
number of clock cycles between the end of one thread and the start of
the enxt one is quite large - itvaried from 200,000 to 900,000!
Am I doing something wrong with the way I am collecting the clock_cycle
with SIM_cycle_count(current_cpu) ?
I would really appreciate if anyone could share their thoughts/ideas on
these issues.
Thanks a lot in advance.
-shougata
_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/" to your search.
|