Re: [Gems-users] LogTM Warning


Date: Mon, 09 Jul 2007 11:39:57 -0500
From: Jayaram Bobba <bobba@xxxxxxxxxxx>
Subject: Re: [Gems-users] LogTM Warning
Ruben,

Setting the g_DEADLOCK_THRESHOLD parameter to a higher value might help...
It is possible that the request is getting NACKED for a long duration (in your
case > 50000 cycles). Try

ruby0.setparam g_DEADLOCK_THRESHOLD 20000000

Jayaram


Rubén Titos wrote:
Dear list,

I'm running into the same kind of "pil is 0" warnings than others encountered before me (when using the LogTM protocols) with the difference that my simulations end up aborting because of a "possible deadlock detected". I've modified the benchmarks to prevent the OS from shuffling threads, by binding each thread to a processor (with processor_bind). I run my simulations with 8 and 16 processors, and I get the mentioned error in some of the benchmarks, with the following output.

Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:336: "WARNING: in enable interrupts" is WARNING: in enable interrupts Warning: in fn void RegisterState::enableInterrupts(int) in log_tm/RegisterStateWindowed.C:337: pil is 0
[...] (more of the same)

Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:105: Possible Deadlock detected Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:106: request is [CacheMsg: Address=[0x6d1cff0, line 0x6d1cfc0] Type=LD ProgramCounter=[0x1001b00, line 0x1001b00] AccessMode=SupervisorMode Size=8 Prefetch=No Version=0 Aborted=0 Time=192911535 ] Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:107: m_chip_ptr->getID() is 0 Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:108: m_version is 0 Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:109: keys.size() is 1 Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:110: current_time is 192961881 Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:111: request.getTime() is 192911535 Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:112: current_time - request.getTime() is 50346 Warning: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:113: *m_readRequestTable_ptr is [ [0x6d1cfc0, line 0x6d1cfc0]=[CacheMsg: Address=[0x6d1cff0, line 0x6d1cfc0] Type=LD ProgramCounter=[0x1001b00, line 0x1001b00] AccessMode=SupervisorMode Size=8 Prefetch=No Version=0 Aborted=0 Time=192911535 ] ] Fatal Error: in fn virtual void Sequencer::wakeup() in system/Sequencer.C:114: Aborting

I also tried to run 8 threaded benchmarks on a 16p simulated machine, just to leave plenty of cpus to the OS to run on, so that (hopefully) nothing is interfering with my threads, but I still got the same problem. Any idea about what might be happening? I believe the mistake is not in the benchmarks, as all of them work well when I use lock-based synchronization instead of transactions.

Any help is highly appreciated,

Ruben

On 5/06/2007, at 6:56 AM, James Poe wrote:

Oops, as a minor clarification in that example trace that I made up, its actually Processor 0 that is BEING NACKed repeated (in my previous post I said that processor 0 was NACKing some other processor).
James

On 6/4/07, *James Poe* < gemsmaillist@xxxxxxxxx <mailto:gemsmaillist@xxxxxxxxx>> wrote:

    James,

    Thanks for the update.  I'm pretty sure this issue is related to
    the cause of a problem I have posted about previously - having
    many of the LogTM benchmarks fail with the "Simics got unstalled"
    error even though I have processor binding enabled.  I went
    through many cases of this error and what I have found is that it
    occurs following a trail of thousands of NACK messages.  Here is
    the interesting part - these NACKS are not preceded by a begin
    transaction for that processor, but instead the last debug
    message from that processor is a commit followed by the pil warning.

    For example, the trace will appear as such:


    <some trace data...>

    Processor 0 commits transaction 3
    pil not set to 0 warning

    <period of trace data not including any processor 0 transactions...>

    Processor 0 NACKs some other processor
    Processor 0 NACKs some other processor
    Processor 0 NACKs some other processor
    Processor 0 NACKs some other processor
    ...
    "Simics get unstalled without ruby asking?" error


    I have also noticed that this error is usually caused within the
    same transaction (meaning if I run the program multiple times,
    its usually always the same transaction ID that causes the issue
    - possibly indicating that indeed something in the source is
    triggering the interrupt).  I am fairly new to the SPARC
    architecture, and naive as to how exactly the pil register
    functions.  Do you know if it functions as some form of interrupt
    priority level?  For example, if we were to modify Ruby to write
    16 into the pil register instead of 15, would this stop the OS
    from being able to interrupt and cause this issue?

    Thanks again for all of your insight.  Anyone else who may have
    any additional insight, please feel free to chime in as well ... =)

    All the best,

    James P



    On 6/2/07, *James Wang * <jameswang99@xxxxxxxxx
    <mailto:jameswang99@xxxxxxxxx>> wrote:

        Hi :
            I did some more work about the following warning:
        Warning: in fn void RegisterState::enableInterrupts(int) in
        log_tm/RegisterStateWindowed.C:336: "WARNING: in enable
        interrupts" is WARNING: in enable interrupts
        Warning: in fn void RegisterState::enableInterrupts(int) in
        log_tm/RegisterStateWindowed.C:337: pil is 0
             and found out that these warning is caused by a trap
        exception. Currently, the LogTM RegisterStateWindowed object
        will disable the interrupts by setting the pil register to 15
        at the beginning of a transaction. Then, when the transaction
        is about to commit, it checks if the pil register is still 15
        and if it is not 15 the warning is printed onto the screen.
        There are some exceptions that have interrupt level 16 which
        is higher than the pil value 15 a transaction set to when the
        transaction begins, so that exception is handled by the OS.
        When that exception is finished, the OS sets the pil to 0.
        And now the transaction commits and checks the pil is not 15,
        and the warning is thrown.
            So, if you are not sure about what happens in your
        benchmark, I suggest that you turn on the exception profile
        option in $GEMS/gen-script/microbench.py. This will provide
        information about what kind of exception has been taken
        during your benchmark's running time. And you could then try
        to identify whether there is any exception that has interrupt
        level higher than 15 to determine if you are on the right track.

        Regards
        James


        ----- Original Message ----
        From: Cong Wang <jameswang99@xxxxxxxxx
        <mailto:jameswang99@xxxxxxxxx>>
        To: Gems Users < gems-users@xxxxxxxxxxx
        <mailto:gems-users@xxxxxxxxxxx>>
        Sent: Tuesday, May 22, 2007 7:24:23 AM
        Subject: Re: [Gems-users] LogTM Warning

        James:

        It is my understanding that Nacks in LogTM does not always
        result in abort. LogTM aborts if the current transaction is
        NACKed by an older transaction and then the current
        transaction NACKs an older transaction, which means there is
        a possible deadlock situation. And I don't think the pil
        register has any connection with NACKing directly, therefore
        not connected with atomicit violation. If you want to know
        more about pil register setting, please look at the
        $GEMS/ruby/log_tm/RegisterWindow.C. To know more about the
        LogTM protocol and how NACKing works, please read the
        $GEMS/protocol/MOESI_SMP_LogTM_directory- cache.sm
<http://cache.sm/>. I am not sure exactly what kind of problem that you
        are experiencing. But LogTM should be pretty stable in terms
        of atomicity. If you could give me more information about the
        exact atomicity violation, I might be able to suggest more
        precise place to look.
        Regards
        James Wang


        On 22/05/2007, at 4:25 AM, James Poe wrote:

        Hi James,

        Thanks for the response; that does make a lot of sense.  I
        was wondering what your opinion is as to how this changes
        the results of simulations?  It seems from my experience
        that when the pil register is reset, at least one memory
        operation is permitted that should otherwise continue to be
        NACKed (see the trace I posted previously).  Is this true,
and if so, is this generally just written off as error? Also, do you know if this will change the actual functional
        simulation since we may be allowing a memory transaction
        that violates the atomicity of the transaction?

        Your insights are greatly appreciated,

        James P


        On 5/17/07, *James Wang * < jameswang99@xxxxxxxxx
        <mailto:jameswang99@xxxxxxxxx>> wrote:

            Hi James:
                From my understanding of the LogTM code, it seems
            that before a transaction starts it disables the
            processor's interrupt by setting the pil register to 15.
            But sometimes, the pil register gets reset by the
            system. And the warning message you got reflects that.
Regards
            James


            ----- Original Message ----
            From: James Poe < gemsmaillist@xxxxxxxxx
            <mailto:gemsmaillist@xxxxxxxxx>>
            To: gems-users@xxxxxxxxxxx <mailto:gems-users@xxxxxxxxxxx>
            Sent: Thursday, May 17, 2007 4:04:10 AM
            Subject: [Gems-users] LogTM Warning

            I am working with the LogTM protocol and I was wondering
            if anyone could explain to me what this warning means:

            Warning: in fn void RegisterState::enableInterrupts(int)
            in log_tm/RegisterStateWindowed.C:336: "WARNING: in
            enable interrupts" is WARNING: in enable interrupts
            Warning: in fn void RegisterState::enableInterrupts(int)
            in log_tm/RegisterStateWindowed.C:337: pil is 0

            I seem to run into it fairly frequently when I run
            simulations, and it seems to be followed by an
            inconsistency in the output trace.  An example of it
            occurring in the debug trace file shows:

            8366153   3  XACT NACK 2 by proc: 2  ADDR [0x2c3ec580,
line 0x2c3ec580] PC [0xff312090, line 0xff312080] my_ts 7201783 nack_ts 7189475
            8366656   2  XACT STORE 2  ADDR [0x1218a9c0, line
            0x1218a9c0]  PC [0x228a4, line 0x22880]
            8366656   2  XACT COMMIT 2  PC 0x228a8
            Warning: in fn void RegisterState::enableInterrupts(int)
            in log_tm/RegisterStateWindowed.C:336: "WARNING: in
            enable interrupts" is WARNING: in enable interrupts
            Warning: in fn void RegisterState::enableInterrupts(int)
            in log_tm/RegisterStateWindowed.C:337: pil is 0
            8366666   1  XACT NACK 2 by proc: 2  ADDR [0x2c3ec580,
line 0x2c3ec580] PC [0xff312090, line 0xff312080] my_ts 7213054 nack_ts 7189475
            8381469   3  XACT LOAD 2  ADDR [0x37456080, line
            0x37456080]  PC [0xff312094, line 0xff312080]
            8381479   3  XACT STORE 2  ADDR [0x2c3ec5bc, line
            0x2c3ec580]  PC [0xff3120d4, line 0xff3120c0]


            I believe this shows that CPU-3 is NACKed when accessing
            cache line 0x2c3ec580.  However, the warning message
            then appears and the next message we have concerning
            CPU-3 and T-2 is that it is LOADing a DIFFERENT cache
            line, 0x37456080.  The PC has also been incremented by
            4.  To my knowledge, a NACK for an address must be
            followed by either a LOAD/STORE or ABORT for that
            address.  After the warning, however, it seems as though
            that access was just somehow skipped.  I don't notice
            this happening anywhere else, and I believe I have taken
            necessary precautions such as binding processes, etc.

            Any help would be greatly appreciated,

            James
            _______________________________________________
            Gems-users mailing list
            Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
            https://lists.cs.wisc.edu/mailman/listinfo/gems-users
            <https://lists.cs.wisc.edu/mailman/listinfo/gems-users>
            Use Google to search the GEMS Users mailing list by
            adding "site:
            https://lists.cs.wisc.edu/archive/gems-users/";
            <https://lists.cs.wisc.edu/archive/gems-users/%22>; to
            your search.


            _______________________________________________
            Gems-users mailing list
            Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
            https://lists.cs.wisc.edu/mailman/listinfo/gems-users
            <https://lists.cs.wisc.edu/mailman/listinfo/gems-users>
            Use Google to search the GEMS Users mailing list by
            adding "site:
            https://lists.cs.wisc.edu/archive/gems-users/
            <https://lists.cs.wisc.edu/archive/gems-users/>" to your
            search.



        _______________________________________________
        Gems-users mailing list
        Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
        https://lists.cs.wisc.edu/mailman/listinfo/gems-users
        Use Google to search the GEMS Users mailing list by adding
        "site: https://lists.cs.wisc.edu/archive/gems-users/"; to
        your search.


        _______________________________________________
        Gems-users mailing list
        Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
        https://lists.cs.wisc.edu/mailman/listinfo/gems-users
        <https://lists.cs.wisc.edu/mailman/listinfo/gems-users>
        Use Google to search the GEMS Users mailing list by adding
        "site: https://lists.cs.wisc.edu/archive/gems-users/";
        <https://lists.cs.wisc.edu/archive/gems-users/%22> ; to your
        search.



        _______________________________________________
        Gems-users mailing list
        Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
        https://lists.cs.wisc.edu/mailman/listinfo/gems-users
        Use Google to search the GEMS Users mailing list by adding
        "site: https://lists.cs.wisc.edu/archive/gems-users/
        <https://lists.cs.wisc.edu/archive/gems-users/>" to your search.




_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
https://lists.cs.wisc.edu/mailman/listinfo/gems-users <https://lists.cs.wisc.edu/mailman/listinfo/gems-users> Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/ <https://lists.cs.wisc.edu/archive/gems-users/>" to your search.



------------------------------------------------------------------------

_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.

[← Prev in Thread] Current Thread [Next in Thread→]