Dear list,
I'm running into the same kind of "pil is 0" warnings
than others encountered before me (when using the LogTM protocols) with the
difference that my simulations end up aborting because of a "possible
deadlock detected". I've modified the benchmarks to prevent the OS
from shuffling threads, by binding each
thread to a processor (with processor_bind). I run my simulations with 8 and 16 processors, and I get the mentioned error in some of the benchmarks, with the following output.
Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:336: "WARNING: in enable interrupts" is WARNING: in enable interrupts
Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:337: pil is 0
[...] (more of the same)
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:105: Possible Deadlock detected Warning:
in fn virtual void Sequencer::wakeup() in system/Sequencer.C:106:
request is [CacheMsg: Address=[0x6d1cff0, line 0x6d1cfc0] Type=LD
ProgramCounter=[0x1001b00, line 0x1001b00] AccessMode=SupervisorMode
Size=8 Prefetch=No Version=0 Aborted=0 Time=192911535 ]
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:107: m_chip_ptr->getID() is 0 Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:108: m_version is 0 Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:109:
keys.size() is 1 Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:110: current_time is 192961881 Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:111: request.getTime() is 192911535
Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:112: current_time - request.getTime() is 50346 Warning:
in fn virtual void Sequencer::wakeup() in system/Sequencer.C:113:
*m_readRequestTable_ptr is [ [0x6d1cfc0, line 0x6d1cfc0]=[CacheMsg:
Address=[0x6d1cff0, line 0x6d1cfc0] Type=LD ProgramCounter=[0x1001b00,
line 0x1001b00] AccessMode=SupervisorMode Size=8 Prefetch=No Version=0
Aborted=0 Time=192911535 ] ]
Fatal Error: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:114: Aborting
I also tried to run 8 threaded benchmarks on a 16p simulated machine,
just to leave plenty of cpus to the OS to run on, so that (hopefully) nothing is interfering with my threads, but I still got the same problem. Any idea about what might be happening? I believe the mistake is not in the benchmarks, as all of them work well when I use lock-based synchronization instead of transactions.
Any help is highly appreciated,
Ruben
On 5/06/2007, at 6:56 AM, James Poe wrote:
Oops,
as a minor clarification in that example trace that I made up, its
actually Processor 0 that is BEING NACKed repeated (in my previous post
I said that processor 0 was NACKing some other processor).
James
On 6/4/07, James Poe <
gemsmaillist@xxxxxxxxx> wrote: James,
Thanks
for the update. I'm pretty sure this issue is related to the cause of
a problem I have posted about previously - having many of the LogTM
benchmarks fail with the "Simics got unstalled" error even though I
have processor binding enabled. I went through many cases of this
error and what I have found is that it occurs following a trail of
thousands of NACK messages. Here is the interesting part - these NACKS
are not preceded by a begin transaction for that processor, but instead
the last debug message from that processor is a commit followed by the
pil warning.
For example, the trace will appear as such:
<some trace data...>
Processor 0 commits transaction 3 pil not set to 0 warning
<period of trace data not including any processor 0 transactions...>
Processor 0 NACKs some other processor Processor 0 NACKs some other processor Processor 0 NACKs some other processor Processor 0 NACKs some other processor ... "Simics get unstalled without ruby asking?" error
I
have also noticed that this error is usually caused within the same
transaction (meaning if I run the program multiple times, its usually
always the same transaction ID that causes the issue - possibly
indicating that indeed something in the source is triggering the
interrupt). I am fairly new to the SPARC architecture, and naive as to
how exactly the pil register functions. Do you know if it functions as
some form of interrupt priority level? For example, if we were to
modify Ruby to write 16 into the pil register instead of 15, would this
stop the OS from being able to interrupt and cause this issue?
Thanks
again for all of your insight. Anyone else who may have any additional
insight, please feel free to chime in as well ... =)
All the best,
James P
On 6/2/07, James Wang
<jameswang99@xxxxxxxxx> wrote:
Hi : I did some more work about the following warning: Warning:
in fn void RegisterState::enableInterrupts(int) in
log_tm/RegisterStateWindowed.C:336: "WARNING: in enable interrupts" is
WARNING: in enable interrupts Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:337: pil is 0 and found out that these warning is caused by a trap exception. Currently, the LogTM
RegisterStateWindowed
object will disable the interrupts by setting the pil register to 15 at
the beginning of a transaction. Then, when the transaction is about to
commit, it checks if the pil register is still 15 and if it is not 15
the warning is printed onto the screen. There are some exceptions that
have interrupt level 16 which is higher than the pil value 15 a
transaction set to when the transaction begins, so that exception is
handled by the OS. When that exception is finished, the OS sets the pil
to 0. And now the transaction commits and checks the pil is not 15, and
the warning is thrown. So, if you are not sure about what
happens in your benchmark, I suggest that you turn on the exception
profile option in $GEMS/gen-script/microbench.py. This will provide
information about what kind of exception has been taken during your
benchmark's running time. And you could then try to identify whether
there is any exception that has interrupt level higher than 15 to
determine if you are on the right track.
Regards James
----- Original Message ---- From: Cong Wang < jameswang99@xxxxxxxxx> To: Gems Users <
gems-users@xxxxxxxxxxx> Sent: Tuesday, May 22, 2007 7:24:23 AM Subject: Re: [Gems-users] LogTM Warning James:
It
is my understanding that Nacks in LogTM does not always result in
abort. LogTM aborts if the current transaction is NACKed by an older
transaction and then the current transaction NACKs an older
transaction, which means there is a possible deadlock situation. And I
don't think the pil register has any connection with NACKing directly,
therefore not connected with atomicit violation. If you want to know
more about pil register setting, please look at the
$GEMS/ruby/log_tm/RegisterWindow.C. To know more about the LogTM
protocol and how NACKing works, please read the
$GEMS/protocol/MOESI_SMP_LogTM_directory- cache.sm.
I
am not sure exactly what kind of problem that you are experiencing. But
LogTM should be pretty stable in terms of atomicity. If you could give
me more information about the exact atomicity violation, I might be
able to suggest more precise place to look. Regards James Wang
On 22/05/2007, at 4:25 AM, James Poe wrote:
Hi James,
Thanks
for the response; that does make a lot of sense. I was wondering what
your opinion is as to how this changes the results of simulations? It
seems from my experience that when the pil register is reset, at least
one memory operation is permitted that should otherwise continue to be
NACKed (see the trace I posted previously). Is this true, and if so,
is this generally just written off as error? Also, do you know if this
will change the actual functional simulation since we may be allowing a
memory transaction that violates the atomicity of the transaction?
Your insights are greatly appreciated,
James P
On 5/17/07, James Wang
< jameswang99@xxxxxxxxx > wrote:
Hi James: From
my understanding of the LogTM code, it seems that before a transaction
starts it disables the processor's interrupt by setting the pil
register to 15. But sometimes, the pil register gets reset by the
system. And the warning message you got reflects that. Regards James
----- Original Message ---- From: James Poe < gemsmaillist@xxxxxxxxx>
To:
gems-users@xxxxxxxxxxx Sent: Thursday, May 17, 2007 4:04:10 AM Subject: [Gems-users] LogTM Warning
I am working with the LogTM protocol and I was wondering if anyone could explain to me what this warning means:
Warning:
in fn void RegisterState::enableInterrupts(int) in
log_tm/RegisterStateWindowed.C:336: "WARNING: in enable interrupts" is
WARNING: in enable interrupts Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:337: pil is 0
I
seem to run into it fairly frequently when I run simulations, and it
seems to be followed by an inconsistency in the output trace. An
example of it occurring in the debug trace file shows:
8366153
3 XACT NACK 2 by proc: 2 ADDR [0x2c3ec580, line 0x2c3ec580] PC
[0xff312090, line 0xff312080] my_ts 7201783 nack_ts 7189475 8366656 2 XACT STORE 2 ADDR [0x1218a9c0, line 0x1218a9c0] PC [0x228a4, line 0x22880] 8366656 2 XACT COMMIT 2 PC 0x228a8 Warning:
in fn void RegisterState::enableInterrupts(int) in
log_tm/RegisterStateWindowed.C:336: "WARNING: in enable interrupts" is
WARNING: in enable interrupts Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:337: pil is 0 8366666
1 XACT NACK 2 by proc: 2 ADDR [0x2c3ec580, line 0x2c3ec580] PC
[0xff312090, line 0xff312080] my_ts 7213054 nack_ts 7189475 8381469 3 XACT LOAD 2 ADDR [0x37456080, line 0x37456080] PC [0xff312094, line 0xff312080] 8381479 3 XACT STORE 2 ADDR [0x2c3ec5bc, line 0x2c3ec580] PC [0xff3120d4, line 0xff3120c0]
I
believe this shows that CPU-3 is NACKed when accessing cache line
0x2c3ec580. However, the warning message then appears and the next
message we have concerning CPU-3 and T-2 is that it is LOADing a
DIFFERENT cache line, 0x37456080. The PC has also been incremented by
4. To my knowledge, a NACK for an address must be followed by either a
LOAD/STORE or ABORT for that address. After the warning, however, it
seems as though that access was just somehow skipped. I don't notice
this happening anywhere else, and I believe I have taken necessary
precautions such as binding processes, etc.
Any help would be greatly appreciated,
James
_______________________________________________
Gems-users mailing list Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users Use Google to search the GEMS Users mailing list by adding "site:
https://lists.cs.wisc.edu/archive/gems-users/
" to your search.
_______________________________________________ Gems-users mailing list
_______________________________________________ Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:
https://lists.cs.wisc.edu/archive/gems-users/ " to your search.
_______________________________________________ Gems-users mailing list
|