Date: | Fri, 18 Jan 2008 11:42:45 +0100 |
---|---|
From: | "Rubén Titos" <rtitos@xxxxxxxxxxx> |
Subject: | Re: [Gems-users] MESI_CMP_filter_directory - race with PUTX |
Hi Luke, Thanks for your reply. The silent drop of a stale PUTX is actually the first thing I looked into when I was debugging the protocol: I looked for the "conversion" of the PUTX into a PUTX_old when the node is removed as a sharer, but I noticed that in certain cases this didn't happen, and I'll try to show when (numbers in brackets refer to msg arrival time of my picture): - (2) Exclusive=Sharers={L1cache-1] - (6) Exclusive=Sharers={L1cache-1] -----> Sharers={L1cache-0, L1cache-1] - (7) Sharers={L1cache-0, L1cache-1] ------> L1_PUTX not discarded since L1cache-1 is a sharer - (some time in between 7 and 8) L1_GETX from L1cache-1 arrives but the line is locked in MT_IB, waiting for the WB_Data from L1cache-1 to complete the $-to-$ transfer. - (8) Sharers={L1cache-0, L1cache-1] ------> When WB_Data arrives, data is written to the L2 cache and L1cache-1 stays as a sharer, because the WB msg doesn't inform L2 that when the Fwd_GETS arrived, the data was sent to both L1cache-0 and L2 from the TBE, since it was being replaced. - (10) Sharers={L1cache-0, L1cache-1] ------> At this point, there is a chance that the L1_PUTX is served and a WB_ack is sent to L1cache-1 while the block is still in MP at L1, thus avoiding the buggy race. But since the recycling of both L1_PUTX and L1_GETX can leave any of them at the front of the queue after the previous WB_Data is processed. If it happens that is the turn for L1_GETX, the request will be served and, since L1cache-1 is in sharers, the L1_PUTX won't be discarded but recycled instead. - (15) Sharers={L1cache-1] ------> Since the GETX request was originated from the replacing cache, L1cache-1 "never leaves" the sharers list and thus the PUTX never will turn into a PUTX_old. After the Exclusive_Unblock is received, the line is unlocked at L2 and the PUTX is (incorrectly) served; when the WB_Ack arrives at L1, the block is in M state, causing the crash. I think this race could be solved by informing the L2 that it should remove the last owner of the block. I've added a bit "removeLastOwner" that is set to true if the message "DataSfromL1" contains data sent from TBE. The bit is always copied to the Unblock msg, so that when the L2 receives the Unblock from L1cache-0, it can remove L1cache-1 from sharers and exclusive, while waiting for the WB_Data. In this way, L1cache-0 is a the only sharer but it doesn't have permission to write the block (block in SS at L2 and S at Exclusive's L1). This solved this particular type of race for me. Probably it could be solved as well by including a similar "removeWritebacker" directly in the WB_Data message. However, this is not the only kind of race that the randomized tester can extract from this protocol. A similar "invalid transition" crash occurs when a PUTX from a certain block is delayed enough cycles, and a second request to the replaced block (served by the replacing cache because the block was forwarded to another cache and thus invalidated and removed from TBE in the replacer) manages to complete before the PUTX has arrived at L2. Here you have the trace: http://skywalker.inf.um.es/~rtitos/files/tester_output I know that this races are due to the way I insert the L1_Replacement events when flushing the cache, and, as Dan Gibson noted, they cannot occur in the Simics+ruby in order model, as long as a second request is not considered until the first one (the one that trigger the replacement) has comleted. But they way I flush the L1 cache (by issuing replacement events at xact begin, without having a "real" data request but an artificial "FLUSH" request instead) could potentially lead to such problems. Of course, the features of ruby's simple network make it impossible(?) for a later GETX to "surpass" a previous PUTX for the same block (or at least none of my benchmarks has crashed due to this race when running with simics). However, I think in a realistic environment (if messages could get delayed many cycles) it could actually happen. Cheers, Ruben On Jan 17, 2008 11:42 PM, Luke Yen <lyen@xxxxxxxxxxx> wrote: Hi Rubin, -- Ruben Titos Parallel Computing and Architecture Group (GACOP) Computer Engineering Dept. University of Murcia |
[← Prev in Thread] | Current Thread | [Next in Thread→] |
---|---|---|
|
Previous by Date: | [Gems-users] accessing memory values possible ?, srikant |
---|---|
Next by Date: | Re: [Gems-users] accessing memory values possible ?, Greg Byrd |
Previous by Thread: | Re: [Gems-users] MESI_CMP_filter_directory - race with PUTX, Dan Gibson |
Next by Thread: | Re: [Gems-users] MESI_CMP_filter_directory - race with PUTX, Rubén Titos |
Indexes: | [Date] [Thread] |