Re: [Gems-users] LogTM Warning

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date:	Mon, 09 Jul 2007 11:39:57 -0500
From:	Jayaram Bobba <bobba@xxxxxxxxxxx>
Subject:	Re: [Gems-users] LogTM Warning

Ruben,

Setting the g_DEADLOCK_THRESHOLD parameter to a higher value might help...

It is possible that the request is getting NACKED for a long duration(in your

case > 50000 cycles). Try

ruby0.setparam g_DEADLOCK_THRESHOLD 20000000

Jayaram


Rubén Titos wrote:

Dear list,

I'm running into the same kind of "pil is 0" warnings than othersencountered before me (when using the LogTM protocols) with thedifference that my simulations end up aborting because of a "possibledeadlock detected". I've modified the benchmarks to prevent the OSfrom shuffling threads, by binding each thread to a processor (withprocessor_bind). I run my simulations with 8 and 16 processors, and Iget the mentioned error in some of the benchmarks, with the followingoutput.

Warning: in fn void RegisterState::enableInterrupts(int) inlog_tm/RegisterStateWindowed.C:336: "WARNING: in enable interrupts" isWARNING: in enable interruptsWarning: in fn void RegisterState::enableInterrupts(int) inlog_tm/RegisterStateWindowed.C:337: pil is 0

[...] (more of the same)

Warning: in fn virtual void Sequencer::wakeup() insystem/Sequencer.C:105: Possible Deadlock detectedWarning: in fn virtual void Sequencer::wakeup() insystem/Sequencer.C:106: request is [CacheMsg: Address=[0x6d1cff0, line0x6d1cfc0] Type=LD ProgramCounter=[0x1001b00, line 0x1001b00]AccessMode=SupervisorMode Size=8 Prefetch=No Version=0 Aborted=0Time=192911535 ]Warning: in fn virtual void Sequencer::wakeup() insystem/Sequencer.C:107: m_chip_ptr->getID() is 0Warning: in fn virtual void Sequencer::wakeup() insystem/Sequencer.C:108: m_version is 0Warning: in fn virtual void Sequencer::wakeup() insystem/Sequencer.C:109: keys.size() is 1Warning: in fn virtual void Sequencer::wakeup() insystem/Sequencer.C:110: current_time is 192961881Warning: in fn virtual void Sequencer::wakeup() insystem/Sequencer.C:111: request.getTime() is 192911535Warning: in fn virtual void Sequencer::wakeup() insystem/Sequencer.C:112: current_time - request.getTime() is 50346Warning: in fn virtual void Sequencer::wakeup() insystem/Sequencer.C:113: *m_readRequestTable_ptr is [ [0x6d1cfc0, line0x6d1cfc0]=[CacheMsg: Address=[0x6d1cff0, line 0x6d1cfc0] Type=LDProgramCounter=[0x1001b00, line 0x1001b00] AccessMode=SupervisorModeSize=8 Prefetch=No Version=0 Aborted=0 Time=192911535 ] ]Fatal Error: in fn virtual void Sequencer::wakeup() insystem/Sequencer.C:114: Aborting

I also tried to run 8 threaded benchmarks on a 16p simulated machine,just to leave plenty of cpus to the OS to run on, so that (hopefully)nothing is interfering with my threads, but I still got the sameproblem. Any idea about what might be happening? I believe the mistakeis not in the benchmarks, as all of them work well when I uselock-based synchronization instead of transactions.


Any help is highly appreciated,

Ruben

On 5/06/2007, at 6:56 AM, James Poe wrote:

Oops, as a minor clarification in that example trace that I made up,its actually Processor 0 that is BEING NACKed repeated (in myprevious post I said that processor 0 was NACKing some otherprocessor).

James

On 6/4/07, *James Poe* < gemsmaillist@xxxxxxxxx<mailto:gemsmaillist@xxxxxxxxx>> wrote:


    James,

    Thanks for the update.  I'm pretty sure this issue is related to
    the cause of a problem I have posted about previously - having
    many of the LogTM benchmarks fail with the "Simics got unstalled"
    error even though I have processor binding enabled.  I went
    through many cases of this error and what I have found is that it
    occurs following a trail of thousands of NACK messages.  Here is
    the interesting part - these NACKS are not preceded by a begin
    transaction for that processor, but instead the last debug
    message from that processor is a commit followed by the pil warning.

    For example, the trace will appear as such:


    <some trace data...>

    Processor 0 commits transaction 3
    pil not set to 0 warning

    <period of trace data not including any processor 0 transactions...>

    Processor 0 NACKs some other processor
    Processor 0 NACKs some other processor
    Processor 0 NACKs some other processor
    Processor 0 NACKs some other processor
    ...
    "Simics get unstalled without ruby asking?" error


    I have also noticed that this error is usually caused within the
    same transaction (meaning if I run the program multiple times,
    its usually always the same transaction ID that causes the issue
    - possibly indicating that indeed something in the source is
    triggering the interrupt).  I am fairly new to the SPARC
    architecture, and naive as to how exactly the pil register
    functions.  Do you know if it functions as some form of interrupt
    priority level?  For example, if we were to modify Ruby to write
    16 into the pil register instead of 15, would this stop the OS
    from being able to interrupt and cause this issue?

    Thanks again for all of your insight.  Anyone else who may have
    any additional insight, please feel free to chime in as well ... =)

    All the best,

    James P



    On 6/2/07, *James Wang * <jameswang99@xxxxxxxxx
    <mailto:jameswang99@xxxxxxxxx>> wrote:

        Hi :
            I did some more work about the following warning:
        Warning: in fn void RegisterState::enableInterrupts(int) in
        log_tm/RegisterStateWindowed.C:336: "WARNING: in enable
        interrupts" is WARNING: in enable interrupts
        Warning: in fn void RegisterState::enableInterrupts(int) in
        log_tm/RegisterStateWindowed.C:337: pil is 0
             and found out that these warning is caused by a trap
        exception. Currently, the LogTM RegisterStateWindowed object
        will disable the interrupts by setting the pil register to 15
        at the beginning of a transaction. Then, when the transaction
        is about to commit, it checks if the pil register is still 15
        and if it is not 15 the warning is printed onto the screen.
        There are some exceptions that have interrupt level 16 which
        is higher than the pil value 15 a transaction set to when the
        transaction begins, so that exception is handled by the OS.
        When that exception is finished, the OS sets the pil to 0.
        And now the transaction commits and checks the pil is not 15,
        and the warning is thrown.
            So, if you are not sure about what happens in your
        benchmark, I suggest that you turn on the exception profile
        option in $GEMS/gen-script/microbench.py. This will provide
        information about what kind of exception has been taken
        during your benchmark's running time. And you could then try
        to identify whether there is any exception that has interrupt
        level higher than 15 to determine if you are on the right track.

        Regards
        James


        ----- Original Message ----
        From: Cong Wang <jameswang99@xxxxxxxxx
        <mailto:jameswang99@xxxxxxxxx>>
        To: Gems Users < gems-users@xxxxxxxxxxx
        <mailto:gems-users@xxxxxxxxxxx>>
        Sent: Tuesday, May 22, 2007 7:24:23 AM
        Subject: Re: [Gems-users] LogTM Warning

        James:

        It is my understanding that Nacks in LogTM does not always
        result in abort. LogTM aborts if the current transaction is
        NACKed by an older transaction and then the current
        transaction NACKs an older transaction, which means there is
        a possible deadlock situation. And I don't think the pil
        register has any connection with NACKing directly, therefore
        not connected with atomicit violation. If you want to know
        more about pil register setting, please look at the
        $GEMS/ruby/log_tm/RegisterWindow.C. To know more about the
        LogTM protocol and how NACKing works, please read the
        $GEMS/protocol/MOESI_SMP_LogTM_directory- cache.sm

<http://cache.sm/>.I am not sure exactly what kind of problem that you

        are experiencing. But LogTM should be pretty stable in terms
        of atomicity. If you could give me more information about the
        exact atomicity violation, I might be able to suggest more
        precise place to look.
        Regards
        James Wang


        On 22/05/2007, at 4:25 AM, James Poe wrote:

        Hi James,

        Thanks for the response; that does make a lot of sense.  I
        was wondering what your opinion is as to how this changes
        the results of simulations?  It seems from my experience
        that when the pil register is reset, at least one memory
        operation is permitted that should otherwise continue to be
        NACKed (see the trace I posted previously).  Is this true,

and if so, is this generally just written off as error?Also, do you know if this will change the actual functional

        simulation since we may be allowing a memory transaction
        that violates the atomicity of the transaction?

        Your insights are greatly appreciated,

        James P


        On 5/17/07, *James Wang * < jameswang99@xxxxxxxxx
        <mailto:jameswang99@xxxxxxxxx>> wrote:

            Hi James:
                From my understanding of the LogTM code, it seems
            that before a transaction starts it disables the
            processor's interrupt by setting the pil register to 15.
            But sometimes, the pil register gets reset by the
            system. And the warning message you got reflects that.

Regards

            James


            ----- Original Message ----
            From: James Poe < gemsmaillist@xxxxxxxxx
            <mailto:gemsmaillist@xxxxxxxxx>>
            To: gems-users@xxxxxxxxxxx <mailto:gems-users@xxxxxxxxxxx>
            Sent: Thursday, May 17, 2007 4:04:10 AM
            Subject: [Gems-users] LogTM Warning

            I am working with the LogTM protocol and I was wondering
            if anyone could explain to me what this warning means:

            Warning: in fn void RegisterState::enableInterrupts(int)
            in log_tm/RegisterStateWindowed.C:336: "WARNING: in
            enable interrupts" is WARNING: in enable interrupts
            Warning: in fn void RegisterState::enableInterrupts(int)
            in log_tm/RegisterStateWindowed.C:337: pil is 0

            I seem to run into it fairly frequently when I run
            simulations, and it seems to be followed by an
            inconsistency in the output trace.  An example of it
            occurring in the debug trace file shows:

            8366153   3  XACT NACK 2 by proc: 2  ADDR [0x2c3ec580,

line 0x2c3ec580] PC [0xff312090, line 0xff312080]my_ts 7201783 nack_ts 7189475

            8366656   2  XACT STORE 2  ADDR [0x1218a9c0, line
            0x1218a9c0]  PC [0x228a4, line 0x22880]
            8366656   2  XACT COMMIT 2  PC 0x228a8
            Warning: in fn void RegisterState::enableInterrupts(int)
            in log_tm/RegisterStateWindowed.C:336: "WARNING: in
            enable interrupts" is WARNING: in enable interrupts
            Warning: in fn void RegisterState::enableInterrupts(int)
            in log_tm/RegisterStateWindowed.C:337: pil is 0
            8366666   1  XACT NACK 2 by proc: 2  ADDR [0x2c3ec580,

line 0x2c3ec580] PC [0xff312090, line 0xff312080]my_ts 7213054 nack_ts 7189475

            8381469   3  XACT LOAD 2  ADDR [0x37456080, line
            0x37456080]  PC [0xff312094, line 0xff312080]
            8381479   3  XACT STORE 2  ADDR [0x2c3ec5bc, line
            0x2c3ec580]  PC [0xff3120d4, line 0xff3120c0]


            I believe this shows that CPU-3 is NACKed when accessing
            cache line 0x2c3ec580.  However, the warning message
            then appears and the next message we have concerning
            CPU-3 and T-2 is that it is LOADing a DIFFERENT cache
            line, 0x37456080.  The PC has also been incremented by
            4.  To my knowledge, a NACK for an address must be
            followed by either a LOAD/STORE or ABORT for that
            address.  After the warning, however, it seems as though
            that access was just somehow skipped.  I don't notice
            this happening anywhere else, and I believe I have taken
            necessary precautions such as binding processes, etc.

            Any help would be greatly appreciated,

            James
            _______________________________________________
            Gems-users mailing list
            Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
            https://lists.cs.wisc.edu/mailman/listinfo/gems-users
            <https://lists.cs.wisc.edu/mailman/listinfo/gems-users>
            Use Google to search the GEMS Users mailing list by
            adding "site:
            https://lists.cs.wisc.edu/archive/gems-users/";
            <https://lists.cs.wisc.edu/archive/gems-users/%22>; to
            your search.


            _______________________________________________
            Gems-users mailing list
            Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
            https://lists.cs.wisc.edu/mailman/listinfo/gems-users
            <https://lists.cs.wisc.edu/mailman/listinfo/gems-users>
            Use Google to search the GEMS Users mailing list by
            adding "site:
            https://lists.cs.wisc.edu/archive/gems-users/
            <https://lists.cs.wisc.edu/archive/gems-users/>" to your
            search.



        _______________________________________________
        Gems-users mailing list
        Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
        https://lists.cs.wisc.edu/mailman/listinfo/gems-users
        Use Google to search the GEMS Users mailing list by adding
        "site: https://lists.cs.wisc.edu/archive/gems-users/"; to
        your search.


        _______________________________________________
        Gems-users mailing list
        Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
        https://lists.cs.wisc.edu/mailman/listinfo/gems-users
        <https://lists.cs.wisc.edu/mailman/listinfo/gems-users>
        Use Google to search the GEMS Users mailing list by adding
        "site: https://lists.cs.wisc.edu/archive/gems-users/";
        <https://lists.cs.wisc.edu/archive/gems-users/%22> ; to your
        search.



        _______________________________________________
        Gems-users mailing list
        Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>
        https://lists.cs.wisc.edu/mailman/listinfo/gems-users
        Use Google to search the GEMS Users mailing list by adding
        "site: https://lists.cs.wisc.edu/archive/gems-users/
        <https://lists.cs.wisc.edu/archive/gems-users/>" to your search.




_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx <mailto:Gems-users@xxxxxxxxxxx>

https://lists.cs.wisc.edu/mailman/listinfo/gems-users<https://lists.cs.wisc.edu/mailman/listinfo/gems-users>Use Google to search the GEMS Users mailing list by adding"site:https://lists.cs.wisc.edu/archive/gems-users/<https://lists.cs.wisc.edu/archive/gems-users/>" to your search.



------------------------------------------------------------------------

_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.

[← Prev in Thread]	Current Thread	[Next in Thread→]
[Gems-users] LogTM Warning, Rubén Titos Re: [Gems-users] LogTM Warning, Jayaram Bobba <= Re: [Gems-users] LogTM Warning, Rubén Titos

Previous by Date:	Re: [Gems-users] Ruby in a cluster, Dan Gibson
Next by Date:	[Gems-users] Running SPLASH2 benchmarks with LogTM, cge
Previous by Thread:	[Gems-users] LogTM Warning, Rubén Titos
Next by Thread:	Re: [Gems-users] LogTM Warning, Rubén Titos
Indexes:	[Date] [Thread]

Mailing List Archives

Authenticated access

Re: [Gems-users] LogTM Warning