That is quite the complex race indeed (nice picture). While I can't
comment on MESI_CMP_filter_directory itself and whether this race is
considered, I can speculate why it wouldn't arise without your
modification. Baseline Simics+Ruby models an SC in-order uniprocessor
that blocks on any coherence request until completion.
That is, ruby's isReady() might end up stalling the L1cache-1's Store
request until the writeback is complete. However, with 'inserted'
invalidations (I'm not 100% how you're inserting the invalidations)
there is a chance that isReady() wouldn't stall the request. At any
rate, I suppose I'm suggesting a protocol-independent fix for you --
have Ruby's isReady() function return false if a writeback is
outstanding for the cache block.
Regards,
Dan
Rubén Titos wrote:
Dear list,
I've been working for some time now with MESI_CMP_filter_directory in
GEMS 2.0, and so far I hadn't run into any protocol bug. However,
after including some cache flushing functionality (that barely
modified the L1 protocol file) I've come across the following race
condition, when debugging my own version the tester and
randomization=true. I believe this mistake is independient from my
code, as I haven't changed any protocol state, event nor transition,
but just added an extra event and some minor additions to the code
in_port(mandatoryQueue_in).
In an attempt for clarity, I'll avoid pasting the debug trace, but
instead I'm going to describe the race condition (btw I've also drawn
it, you can have a look here
http://skywalker.inf.um.es/~rtitos/files/protocol_race.png
<http://skywalker.inf.um.es/%7Ertitos/files/protocol_race.png> )
- It is initiated by the replacement of the line (state M) at
L1cache-1 (in my case when flushing the cache, but it may as well
happen in the original version of such protocol with any common
replacement). The PUTX msg sent by L1cache-1 is delayed many cycles
during its trip to the L2, and in the meantime the following happens:
--- L1cache-0 issues a GETS, that arrives at L2 and is forwarded to
the L1cache-1, finding the line in state I_M (since it's being replaced).
--- L1cache-1 send the data (dataSfromL1) to the requestor (L1cache-0)
and a writeback to the L2, and sets the line is state I.
--- The L1cache-0 receives the data and sends an unblock to the L2.
--- The L1cache-1 receives a Store request, and since the block is in
state I it issues a new GETX message, setting the block to IM state.
--- The unblock from core 0 arrives at L2 and updates the directory
(but it doesn't remove the core 1 from the sharer list).
- Finally, the L1_PUTX arrives at L2. Since the L1cache-1 is still a
sharer, is not marked at L1_PUT_old, but instead is recycled (the line
is blocked at L2 (MT_IB) waiting for a WB_Data from L1cache-1).
- The GETX from L1cache-1 arrives at L2 and it's also recycled.
- Both GETX and PUTX from L1cache-1 are recycled a few times.
- The WB_data from core 1 arrives, and the line is unblocked.
- In this moment, the GETX is the next msg in the recycled queue
(because PUTX was considered right before the WB_data arrived), so
it's processed and the line blocked again, from SS to SS_MB state. The
L2 send INV to L1cache-0 and L2Data to L1cache-1.
- The PUTX is recycled over and over again since the line is blocked.
- L1cache-1 receives L2Data, transits from IM to IM_M, waiting for the
ack from L1cache-0.
- L1cache-0 receives Inv and sends Ack to L1cache-1.
- L1cache-1 receives Ack, changes from IM_M to M (resolves the store
miss) and send Exclusive_Unblock to L2.
- When the L2 receives the Exclusive_Unblock, changes the state from
SS_MB to MT.
- In this moment, the line is not blocked so the PUTX is processed and
a WB_Ack is sent to the L1cache-1.
- CRASH: Upon reception, the line is in state M and it shouldn't
receive any WB_Ack msg in this state (no valid transition).
Can anyone confirm this protocol bug? Or am I missing some subtle but
important step? Again, my code didn't touch any of the "basic protocol
stuff", but what confuses me is that the randomized tester can't make
the original protocol break, but breaks my own version.
Cheers,
Rubén
--
Ruben Titos
Parallel Computing and Architecture Group (GACOP)
Computer Engineering Dept.
University of Murcia
------------------------------------------------------------------------
_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/" to your search.
--
http://www.cs.wisc.edu/~gibson [esc]:wq!
|