[Gems-users] LogTM Warning


Date: Fri, 6 Jul 2007 10:57:47 +0200
From: "Rubén Titos" <rtitos@xxxxxxxxxxx>
Subject: [Gems-users] LogTM Warning
Dear list,

I'm running into the same kind of "pil is 0" warnings than others encountered before me (when using the LogTM protocols) with the difference that my simulations end up aborting because of a "possible deadlock detected". I've modified the benchmarks to prevent the OS from shuffling threads, by binding each thread to a processor (with processor_bind). I run my simulations with 8 and 16 processors, and I get the mentioned error in some of the benchmarks, with the following output.

Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:336: "WARNING: in enable interrupts" is WARNING: in enable interrupts
Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:337: pil is 0
[...] (more of the same)

Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:105: Possible Deadlock detected
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:106: request is [CacheMsg: Address=[0x6d1cff0, line 0x6d1cfc0] Type=LD ProgramCounter=[0x1001b00, line 0x1001b00] AccessMode=SupervisorMode Size=8 Prefetch=No Version=0 Aborted=0 Time=192911535 ]
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:107: m_chip_ptr->getID() is 0
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:108: m_version is 0
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:109: keys.size() is 1
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:110: current_time is 192961881
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:111: request.getTime() is 192911535
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:112: current_time - request.getTime() is 50346
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:113: *m_readRequestTable_ptr is [ [0x6d1cfc0, line 0x6d1cfc0]=[CacheMsg: Address=[0x6d1cff0, line 0x6d1cfc0] Type=LD ProgramCounter=[0x1001b00, line 0x1001b00] AccessMode=SupervisorMode Size=8 Prefetch=No Version=0 Aborted=0 Time=192911535 ] ]
Fatal Error: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:114: Aborting

I also tried to run 8 threaded benchmarks on a 16p simulated machine, just to leave plenty of cpus to the OS to run on, so that (hopefully) nothing is interfering with my threads, but I still got the same problem. Any idea about what might be happening? I believe the mistake is not in the benchmarks, as all of them work well when I use lock-based synchronization instead of transactions.

Any help is highly appreciated,

Ruben

On 5/06/2007, at 6:56 AM, James Poe wrote:

Oops, as a minor clarification in that example trace that I made up, its actually Processor 0 that is BEING NACKed repeated (in my previous post I said that processor 0 was NACKing some other processor). 

James

On 6/4/07, James Poe < gemsmaillist@xxxxxxxxx> wrote:
James,

Thanks for the update.  I'm pretty sure this issue is related to the cause of a problem I have posted about previously - having many of the LogTM benchmarks fail with the "Simics got unstalled" error even though I have processor binding enabled.  I went through many cases of this error and what I have found is that it occurs following a trail of thousands of NACK messages.  Here is the interesting part - these NACKS are not preceded by a begin transaction for that processor, but instead the last debug message from that processor is a commit followed by the pil warning.

For example, the trace will appear as such:


<some trace data...>

Processor 0 commits transaction 3
pil not set to 0 warning

<period of trace data not including any processor 0 transactions...>

Processor 0 NACKs some other processor
Processor 0 NACKs some other processor
Processor 0 NACKs some other processor
Processor 0 NACKs some other processor
...
"Simics get unstalled without ruby asking?" error


I have also noticed that this error is usually caused within the same transaction (meaning if I run the program multiple times, its usually always the same transaction ID that causes the issue - possibly indicating that indeed something in the source is triggering the interrupt).  I am fairly new to the SPARC architecture, and naive as to how exactly the pil register functions.  Do you know if it functions as some form of interrupt priority level?  For example, if we were to modify Ruby to write 16 into the pil register instead of 15, would this stop the OS from being able to interrupt and cause this issue?

Thanks again for all of your insight.  Anyone else who may have any additional insight, please feel free to chime in as well ... =)

All the best,

James P



On 6/2/07, James Wang <jameswang99@xxxxxxxxx> wrote:
Hi :
    I did some more work about the following warning:
Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:336: "WARNING: in enable interrupts" is WARNING: in enable interrupts
Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:337: pil is 0
     and found out that these warning is caused by a trap exception. Currently, the LogTM RegisterStateWindowed object will disable the interrupts by setting the pil register to 15 at the beginning of a transaction. Then, when the transaction is about to commit, it checks if the pil register is still 15 and if it is not 15 the warning is printed onto the screen. There are some exceptions that have interrupt level 16 which is higher than the pil value 15 a transaction set to when the transaction begins, so that exception is handled by the OS. When that exception is finished, the OS sets the pil to 0. And now the transaction commits and checks the pil is not 15, and the warning is thrown.
    So, if you are not sure about what happens in your benchmark, I suggest that you turn on the exception profile option in $GEMS/gen-script/microbench.py. This will provide information about what kind of exception has been taken during your benchmark's running time. And you could then try to identify whether there is any exception that has interrupt level higher than 15 to determine if you are on the right track.

Regards
James


----- Original Message ----
From: Cong Wang <jameswang99@xxxxxxxxx>
To: Gems Users < gems-users@xxxxxxxxxxx>
Sent: Tuesday, May 22, 2007 7:24:23 AM
Subject: Re: [Gems-users] LogTM Warning

James:

It is my understanding that Nacks in LogTM does not always result in abort. LogTM aborts if the current transaction is NACKed by an older transaction and then the current transaction NACKs an older transaction, which means there is a possible deadlock situation. And I don't think the pil register has any connection with NACKing directly, therefore not connected with atomicit violation. If you want to know more about pil register setting, please look at the $GEMS/ruby/log_tm/RegisterWindow.C. To know more about the LogTM protocol and how NACKing works, please read the $GEMS/protocol/MOESI_SMP_LogTM_directory- cache.sm
       I am not sure exactly what kind of problem that you are experiencing. But LogTM should be pretty stable in terms of atomicity. If you could give me more information about the exact atomicity violation, I might be able to suggest more precise place to look.
Regards
James Wang

 


On 22/05/2007, at 4:25 AM, James Poe wrote:

Hi James,

Thanks for the response; that does make a lot of sense.  I was wondering what your opinion is as to how this changes the results of simulations?  It seems from my experience that when the pil register is reset, at least one memory operation is permitted that should otherwise continue to be NACKed (see the trace I posted previously).  Is this true, and if so, is this generally just written off as error?  Also, do you know if this will change the actual functional simulation since we may be allowing a memory transaction that violates the atomicity of the transaction?

Your insights are greatly appreciated,

James P


On 5/17/07, James Wang < jameswang99@xxxxxxxxx > wrote:
Hi James:
    From my understanding of the LogTM code, it seems that before a transaction starts it disables the processor's interrupt by setting the pil register to 15. But sometimes, the pil register gets reset by the system. And the warning message you got reflects that.
 
Regards
James


----- Original Message ----
From: James Poe < gemsmaillist@xxxxxxxxx>
To: gems-users@xxxxxxxxxxx
Sent: Thursday, May 17, 2007 4:04:10 AM
Subject: [Gems-users] LogTM Warning

I am working with the LogTM protocol and I was wondering if anyone could explain to me what this warning means:

Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:336: "WARNING: in enable interrupts" is WARNING: in enable interrupts
Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:337: pil is 0

I seem to run into it fairly frequently when I run simulations, and it seems to be followed by an inconsistency in the output trace.  An example of it occurring in the debug trace file shows:

8366153   3  XACT NACK 2 by proc: 2  ADDR [0x2c3ec580, line 0x2c3ec580]  PC [0xff312090, line 0xff312080]  my_ts 7201783  nack_ts 7189475
8366656   2  XACT STORE 2  ADDR [0x1218a9c0, line 0x1218a9c0]  PC [0x228a4, line 0x22880]
8366656   2  XACT COMMIT 2  PC 0x228a8
Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:336: "WARNING: in enable interrupts" is WARNING: in enable interrupts
Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:337: pil is 0
8366666   1  XACT NACK 2 by proc: 2  ADDR [0x2c3ec580, line 0x2c3ec580]  PC [0xff312090, line 0xff312080]  my_ts 7213054  nack_ts 7189475
8381469   3  XACT LOAD 2  ADDR [0x37456080, line 0x37456080]  PC [0xff312094, line 0xff312080]
8381479   3  XACT STORE 2  ADDR [0x2c3ec5bc, line 0x2c3ec580]  PC [0xff3120d4, line 0xff3120c0]


I believe this shows that CPU-3 is NACKed when accessing cache line 0x2c3ec580.  However, the warning message then appears and the next message we have concerning CPU-3 and T-2 is that it is LOADing a DIFFERENT cache line, 0x37456080.  The PC has also been incremented by 4.  To my knowledge, a NACK for an address must be followed by either a LOAD/STORE or ABORT for that address.  After the warning, however, it seems as though that access was just somehow skipped.  I don't notice this happening anywhere else, and I believe I have taken necessary precautions such as binding processes, etc.

Any help would be greatly appreciated,

James
_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site: https://lists.cs.wisc.edu/archive/gems-users/"; to your search.


_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site: https://lists.cs.wisc.edu/archive/gems-users/ " to your search.



_______________________________________________
Gems-users mailing list
Use Google to search the GEMS Users mailing list by adding "site: https://lists.cs.wisc.edu/archive/gems-users/" to your search.


_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site: https://lists.cs.wisc.edu/archive/gems-users/" ; to your search.



_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site: https://lists.cs.wisc.edu/archive/gems-users/ " to your search.




_______________________________________________
Gems-users mailing list
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/ " to your search.



[← Prev in Thread] Current Thread [Next in Thread→]