Re: [Gems-users] Ruby Segmentation Fault


Date: Mon, 09 Feb 2009 21:52:00 +0200
From: Konstantinos Nikas <knikas@xxxxxxxxxxxxxxxxx>
Subject: Re: [Gems-users] Ruby Segmentation Fault
Hi Jayaram,

I have sent an email with the output file, but it is too big probably to go through and waits for a moderator's confirmation. I could upload it somewhere else if you like to avoid sending it to everyone on the list.

In the meantime we are trying to find the problem ourselves and we are stuck in the following case. We have thread 0 in proc 1 starting a CLOSED transaction and the output is :

41181831 1 [1,0] ADD XACT FRAME oldLogFramePointer: [0x2d9020, line 0x2d9000] newLogFramePointer: [0x2d9020, line 0x2d9000] 1 41181831 1 [1,0] BEGIN XACT: TID 0 XID 10 XACT_LEVEL: 1 PC: [0x137fc, line 0x137c0]

If we understand Ruby's code correctly, at this point the TransactionVersionManager will call beginTransaction, which will call takeCheckpoint, which will execute

m_registers[thread=0][transactionLevel-1 = 0]->takeCheckpoint()

Later this thread decides to abort :

41183071 1 [1,0] SETTING ABORT FLAG ADDR = [0x38002218, line 0x38002200] PC = [0x13880, line 0x13880] NPC = [0x13884, line 0x13880] 41183074 1 [1,0] ISOLATE XACT STORE [0x3b7e3740, line 0x3b7e3740] XACT LEVEL: 1 PC = [0x13880, line 0x13880] 41183077 2 [2,0] ISOLATE XACT STORE [0x38002200, line 0x38002200] XACT LEVEL: 1 PC = [0x12dcc, line 0x12dc0] 41183077 2 [2,0] LOGGING STORE: [0x2ae200, line 0x2ae200] 1 PC = [0x12dcc, line 0x12dc0]
**** Log. proc. num: 2:  m_logSize: 1632 m_maxLogSize: 781
41183077 2 [2,0] ADD UNDO LOG ENTRY: [0x2ae200, line 0x2ae200] [0x38002200, line 0x38002200] LogAddress: [0x3a163c, line 0x3a1600] 1 41183082 2 [2,0] ISOLATE XACT STORE [0x38002200, line 0x38002200] XACT LEVEL: 1 PC = [0x12dd0, line 0x12dc0] 41183082 2 [2,0] LOGGING STORE: [0x2ae200, line 0x2ae200] 0 PC = [0x12dd0, line 0x12dc0] 41183091 2 [2,0] ISOLATE XACT LOAD VA: [0xfeffbec0, line 0xfeffbec0] PA: [0x3c543ec0, line 0x3c543ec0] XACT LEVEL: 1 PC = [0x13244, line 0x13240] 41183091 1 [1,0] TRAP TO HANDLER: TID: 0 TRAP_TYPE 1 TRAP ADDRESS 0x38002218 NUM_RETRIES 0 LOG_SIZE 1360 XACT_LEVEL 1 XACT_LOWEST_CONFLICT_LEVEL 1 Handler Address = [0x1b39c, line
0x1b380] PC = [0x100707c, line 0x1007040]
41183091 1 [1,0] Begin ESCAPE ACTION - ESCAPE DEPTH: 1 PC [0x100707c, line 0x1007040]
Begin exposed action for thread 0 of proc 1 PC [0x1b39c, line 0x1b380]
41183092 1 [1,0] Begin ESCAPE ACTION - ESCAPE DEPTH: 2 PC [0x1b39c, line 0x1b380]

which will release isolation accordingly and restart the transaction.

End exposed action for thread 0 of proc 1 PC [0x1b3dc, line 0x1b3c0]
41194048 1 [1,0] END ESCAPE ACTION - ESCAPE DEPTH: 1 PC [0x1b3dc, line 0x1b3c0]
Restart transaction for thread 0 of proc 1
restartTransactionCallback proc = 1 thread = 0 time = 41194049
41194049 1 [1,0] END ESCAPE ACTION - ESCAPE DEPTH: 0 PC [0x1b3e4, line 0x1b3c0]
1 [1,0] TID 0 RESTART TRANSACTION AT XACT LEVEL: 1 LOG_SIZE: 1360
Segmentation fault (SIGSEGV) in main thread

So, according to the debug output, thread 0 will restart its transaction and the new xact level is 1. So TransactionInterfaceManager:restartTransactionCallback executes:

getXactVersionManager()->restartTransaction(thread = 0, new_xact_level=1)

which will go and call :

m_registers[0][1]->restoreCheckpoint()

which causes the SEG FAULT, because the original transaction took the checkpoint for m_registers[0][0]!

It seems too elementary to be a real bug, so I guess we are missing something in the code.

Kind regards,

Kostis

The segmentation fault seems to occur since ruby does not find the register
checkpoint for the processor that is trying to restart its transaction...

#0  RegisterState::restoreCheckpoint (this=0x0, m_proc=1) at
    /home/users/anastop/gems/gems-2.1//common/Vector.h:92
    #1  0x00002aaab066bc5d in
    TransactionVersionManager::restartTransaction
    (this=0xa341340, thread=0, xact_level=1) at

Can get more debug output by setting XACT_DEBUG and XACT_DEBUG_LEVEL?


Jayaram


Konstantinos Nikas wrote:
The code we are running is a transactional workload that we have developed and we set it up according to the directions provided in the wiki (bind threads, call set_transaction_registers, etc).

The protocol is MESI_CMP_filter_directory as it is the only one LogTM can use (at least in the latest version of GEMS).

Kind regards,

Kostis
What benchmark are you running and what protocol?

Polina

On Thu, Feb 5, 2009 at 12:47 PM, Konstantinos Nikas <knikas@xxxxxxxxxxxxxxxxx <mailto:knikas@xxxxxxxxxxxxxxxxx>> wrote:

    Hi all,

    we have an 8-core CMP and a transactional workload which only uses 2
    threads. We bind the 2 threads to 2 specific processors (avoiding
    always
    core 0). When we set XACT_LOG_BUFFER_SIZE=2048 everything works fine.
    For smaller values (0, 256, 1024) though the simulation fails.

    At first we used to get the following warning messages :

    45936462 2 [2,0] endEscapeAction WARNING escape depth < 1. Depth = 0

    Searching the mailing list we came across a post which suggested
    adding
    a beginEscapeAction() call into hardwareAbort(). We included this
    in our
    code and the warning messages went away. However, the simulations
    still
    fail with a segmentation fault. Gdb reported the following :

    #0  RegisterState::restoreCheckpoint (this=0x0, m_proc=1) at
    /home/users/anastop/gems/gems-2.1//common/Vector.h:92
    #1  0x00002aaab066bc5d in
    TransactionVersionManager::restartTransaction
    (this=0xa341340, thread=0, xact_level=1) at
    /home/users/anastop/gems/gems-2.1//common/Vector.h:109
    #2  0x00002aaab0656b89 in
    TransactionInterfaceManager::restartTransactionCallback
    (this=0xa341230,
    thread=0) at log_tm/TransactionInterfaceManager.C:751
    #3  0x00002aaaad20fb70 in ?? () from
    /home/simics/academic/simics-3.0.31/amd64-linux/lib/sparc-u3.so
    #4  0x00002aaaad1aed99 in ?? () from
    /home/simics/academic/simics-3.0.31/amd64-linux/lib/sparc-u3.so
    #5  0x00002aaaad1aec9a in ?? () from
    /home/simics/academic/simics-3.0.31/amd64-linux/lib/sparc-u3.so
    #6  0x00002b1b49bc2eaf in SIM_continue () from
    /home/simics/academic/simics-3.0.31/amd64-linux/bin/libsimics-common.so
    #7  0x00002b1b49b83a9c in ?? () from
    /home/simics/academic/simics-3.0.31/amd64-linux/bin/libsimics-common.so
    #8  0x00002b1b4aaf739c in PyCFunction_Call (func=0x2aaaaab26560,
    arg=0x2aaaac9f6a50, kw=0x0) at /home/packages/python-2.4.2 .......

    Any ideas? Or suggestions how to debug more efficiently?

    Kind regards,

    Kostis

    PS: A similar situation occurs when we run the same 2 threads on a
    4-core machine. It works fine for XACT_LOG_BUFFER_SIZE=0,256,1024,2048
    and fails for size=32!

    _______________________________________________
    Gems-users mailing list
    Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
    https://lists.cs.wisc.edu/mailman/listinfo/gems-users
    Use Google to search the GEMS Users mailing list by adding
    "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.


------------------------------------------------------------------------

_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.

_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.

_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.
[← Prev in Thread] Current Thread [Next in Thread→]