Re: [Gems-users] MESI_SCMP_ protocol crush


Date: Tue, 1 Apr 2008 03:19:43 -0400
From: "Konstantinos Aisopos" <kaisopos@xxxxxxxxx>
Subject: Re: [Gems-users] MESI_SCMP_ protocol crush
I found some more properties of my problem: The problem is independent
of the protocol (I have the same asserion failure when using
MSI_MOSI_CMP_directory), but it is dependent on the number of nodes.
When simulating 16 nodes (in tester or n ruby) everything works fine,
but when I try 64, the assertion fails.

The assertion has to do with the mapping of memory addresses to an L2
tile. It's inside the function addSharer, and makes sure (whenever a
sharer is added) that this address should indeed be mapped in this
tile:

 void addSharer(Address addr, MachineID requestor) {
    DEBUG_EXPR(machineID);
    DEBUG_EXPR(requestor);
    DEBUG_EXPR(addr);
    assert(map_L1CacheMachId_to_L2Cache(addr, requestor) ==
machineID); <--FAILED
    L2cacheMemory[addr].Sharers.add(requestor);
  }

I found the code that does the mapping but didn't really understand
what's happening (I paste it at the end of this email)

Huan,

My topology is an 8x8 mesh, which I initally created with
GarnetFileMaker.py, and then added the memory nodes manually.

Mike,

here's a trace of MSI_MOSI_CMP_directory, simulating 64 cores (
parameters: -p 64 -e 64 -a 64 -m 64 -n FILE_SPECIFIED -l 1 -s 1)

Request trace enabled to output file 'ruby.trace.gz'
      2  46  -1        Seq               Begin       >       [0x62c0,
line 0x62c0] ST
      4  18  -1        Seq               Begin       >       [0x39c0,
line 0x39c0] ST
      6  31  -1        Seq               Begin       >       [0x2cc0,
line 0x2cc0] ST
      6   0  46    L1Cache               Store     NP>L1_IM  [0x62c0,
line 0x62c0]
      7   0  31    L1Cache               Store     NP>L1_IM  [0x2cc0,
line 0x2cc0]
      8  13  -1        Seq               Begin       >       [0x59c0,
line 0x59c0] ST
      8   0  18    L1Cache               Store     NP>L1_IM  [0x39c0,
line 0x39c0]
     10   0  13    L1Cache               Store     NP>L1_IM  [0x59c0,
line 0x59c0]
     10  51  -1        Seq               Begin       >       [0x3fc0,
line 0x3fc0] ATOMIC
     12  40  -1        Seq               Begin       >       [0x9c0,
line 0x9c0] ST
     13   0  51    L1Cache               Store     NP>L1_IM  [0x3fc0,
line 0x3fc0]
     14  49  -1        Seq               Begin       >       [0x60c0,
line 0x60c0] ST
     16   0  49    L1Cache               Store     NP>L1_IM  [0x60c0,
line 0x60c0]
     16   0  40    L1Cache               Store     NP>L1_IM  [0x9c0,
line 0x9c0]
     16  18  -1        Seq               Begin       >       [0x4c0,
line 0x4c0] ST
     18   2  -1        Seq               Begin       >       [0x56c0,
line 0x56c0] ST
     19   0  18    L1Cache               Store     NP>L1_IM  [0x4c0,
line 0x4c0]
     20  14  -1        Seq               Begin       >       [0x5dc0,
line 0x5dc0] ST
     21   0   2    L1Cache               Store     NP>L1_IM  [0x56c0,
line 0x56c0]
     22  35  -1        Seq               Begin       >       [0x52c0,
line 0x52c0] ATOMIC
     23   0  14    L1Cache               Store     NP>L1_IM  [0x5dc0,
line 0x5dc0]
     24   4  -1        Seq               Begin       >       [0x44c0,
line 0x44c0] ST
Runtime Error at ../protocols/MSI_MOSI_CMP_directory-L2cache.sm:275,
Ruby Time: 24: assert failure, PID: 30232
press return to continue.

I m looking into how to print the machineID and requestor in the
trace, but it seems that the *first time* the code reaches this
assertion (the first time a sharer is added) the asserion fails, so it
seems like a mapping problem, not a coherence protocol problem.

thoughts?

thanks a bunch for the help,
-Kostas

------- mappng code: ruby/slicc_interface/RubySlicc_ComponentMapping.h------

// input parameter is the base ruby node of the L1 cache
// returns a value between 0 and total_L2_Caches_within_the_system
inline
MachineID map_L1CacheMachId_to_L2Cache(const Address& addr, MachineID
L1CacheMachId)
{
  int L2bank = 0;
  MachineID mach = {MACHINETYPE_L2CACHE_ENUM, 0};

  if (RubyConfig::L2CachePerChipBits() > 0) {
    if (MAP_L2BANKS_TO_LOWEST_BITS) {
      L2bank = addr.bitSelect(RubyConfig::dataBlockBits(),

RubyConfig::dataBlockBits()+RubyConfig::L2CachePerChipBits()-1);
    } else {
      L2bank = addr.bitSelect(RubyConfig::dataBlockBits()+L2_CACHE_NUM_SETS_BITS,

RubyConfig::dataBlockBits()+L2_CACHE_NUM_SETS_BITS+RubyConfig::L2CachePerChipBits()-1);
    }
  }

  assert(L2bank < RubyConfig::numberOfL2CachePerChip());
  assert(L2bank >= 0);

  mach.num = RubyConfig::L1CacheNumToL2Base(L1CacheMachId.num)*RubyConfig::numberOfL2CachePerChip()
// base #
    + L2bank;  // bank #
  assert(mach.num < RubyConfig::numberOfL2Cache());
  return mach;
}


On Sun, Mar 30, 2008 at 10:22 PM, Mike Marty <mike.marty@xxxxxxxxx> wrote:
> I have no idea why that assertion would be triggered.  I would print
> out the machineID and requestor. See the wiki for generating a
> protocol debug trace.  Grep on the block address that causes the
> assertion.  Add extra debuggin information to the trace using
> APPEND_TRANSITION_COMMENT and DEBUG_EXPR.
>
> --Mike
>
>
> On Sun, Mar 30, 2008 at 3:27 PM, Konstantinos Aisopos
>
> <kaisopos@xxxxxxxxx> wrote:
> > Hello again,
> >
> >  any ideas about my problem? any idea what this assertion prevents from
> >  happening? Should I provide you more information? Does the MESI_SCMP
> >  require any other parameters to be set that I don't know??
> >
> >  I thought it was a topology problem so I created the file:
> >  ruby/network/simple/Network_Files/NUCA_Procs-64_ProcsPerChip-64_L2Banks-64_Memories-64.txt
> >  and set these parameters:
> >  ruby0.setparam_str g_CACHE_DESIGN NUCA
> >  ruby0.setparam_str g_NETWORK_TOPOLOGY FILE_SPECIFIED
> >  ... the problem still persists. I got rid of opal to make the
> >  simulation simpler. problem persists. Also, if i don't load ruby the
> >  simulation works fine.
> >
> >  help please :P
> >
> >  -Kostas
> >
> >
> >
> >
> >  On Thu, Mar 27, 2008 at 10:54 PM, Konstantinos Aisopos
> >  <kaisopos@xxxxxxxxx> wrote:
> >  > Hi list,
> >  >
> >  > I am using MESI_SCMP_bankdirectory protocol to simulate a 64core
> >  > system. I haven't touched the protocol or the simulator. I am
> >  > executing the following script:
> >  >
> >  > instruction-fetch-mode instruction-fetch-trace
> >  > istc-disable
> >  > dstc-disable
> >  > cpu-switch-time 1
> >  > load-module ruby
> >  > load-module opal
> >  > ruby0.setparam g_NUM_PROCESSORS 64
> >  > ruby0.setparam g_PROCS_PER_CHIP 64
> >  > ruby0.setparam g_NUM_L2_BANKS 64
> >  > ruby0.setparam g_NUM_MEMORIES 64
> >  > ruby0.setparam NUMBER_OF_VIRTUAL_NETWORKS 5
> >  > ruby0.setparam g_MEMORY_SIZE_BYTES 4294967296
> >  > ruby0.setparam g_endpoint_bandwidth 1000
> >  > ruby0.init
> >  > opal0.init
> >  > opal0.sim-start "results.opal"
> >  > opal0.sim-step 10000000000
> >  >
> >  > and i am getting the following error, when i execute "opal0.sim-step
> >  > 10000000000":
> >  >
> >  > Runtime Error at ../protocols/MESI_SCMP_bankdirectory-L2cache.sm:224,
> >  > Ruby Time: 23: assert failure, PID: 1335
> >  >
> >  > the 224 line is:
> >  > assert(map_L1CacheMachId_to_L2Cache(addr,requestor) == machineID)
> >  >
> >  > any idea what might be wrong?
> >  >
> >  > thanks,
> >  >
> >  > Kostas
> >  >
>
> >  _______________________________________________
> >  Gems-users mailing list
> >  Gems-users@xxxxxxxxxxx
> >  https://lists.cs.wisc.edu/mailman/listinfo/gems-users
> >  Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.
> >
> >
> _______________________________________________
> Gems-users mailing list
> Gems-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/gems-users
> Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.
>
>
[← Prev in Thread] Current Thread [Next in Thread→]