Re: [DynInst_API:] Hung process


Date: Wed, 18 Feb 2015 13:36:06 -0800 (PST)
From: Matthew LeGendre <legendre1@xxxxxxxx>
Subject: Re: [DynInst_API:] Hung process

On Wed, 18 Feb 2015, Matthew LeGendre wrote:
On Wed, 18 Feb 2015, Bill Williams wrote:
On 02/18/2015 01:54 PM, Josh Stone wrote:
On 02/18/2015 11:42 AM, Bill Williams wrote:
On 02/18/2015 01:37 PM, Gerard wrote:
Ah ok, I didn't know that.

About how reproducible is the error, I run it three times (without the
change you suggested) and every time stopped at around 32000 threads.
Now I added appProc->continueExecution() and it happened again after
creating 32322 threads, so it seems this is not the problem.

Then it's got to be that somewhere in here, we're messing up internal
stop/continue state without that propagating out to the user level.
Debug logs will tell me something eventually...sadly, they're verbose
and time-consuming.

Which kernel version/distribution are you using, by the way?

TIDs usually wrap at 2^15, so they'll be reused in this test.
Perhaps this is confusing dyninst somewhere?

That's certainly a possibility; we're hitting our starvation case (theoretically running process generates no ptrace events) when the TID and PID once again are the same, which I would expect guarantees that we've recycled a LWPID.

I've attached the tag end of a log that should reflect Gerard's problem; there's postponed syscall handling going on, but by initial mark 1 eyeball nothing's obviously broken (aside from the results)...

Bill,

The problem here seems to be a missing event we expect from Linux. Ptrace should give us two events upon clone(), one from the parent thread and one from the new child thread. When ProcControlAPI sees an event from one thread it hold that thread until it sees both events. In this case we see the parent event, but never see the event from the new child.

In the trace I see a sequence that looks like:

[linux.C:167-G] - Stopped with signal 19
[generator.C:209-G] - Got event
[generator.C:144-G] - Setting generator state to decoding
[generator.C:144-G] - Setting generator state to statesync
[generator.C:144-G] - Setting generator state to queueing
[generator.C:144-G] - Setting generator state to none
[generator.C:144-G] - Setting generator state to process_blocked

I've never seen this before, and I'm not sure what happened. It almost looks like ProcControlAPI got an event that it couldn't understand. I wonder if this is the missing event from the new thread. I'd suggest focusing on this and seeing if you can trace what happened.

A few minutes after I wrote this I realized what the core problem is. ProcControlAPI keep track of "dead threads" in the ProcPool, and use this list to suppress events that trickle in from dead multi-threaded processes (we'd sometimes see Linux feed us queued up debug events from threads after a process's main thread dies). As Josh suggested, we're likely seeing TID reuse and mis-identified the new thread as a lingering event from a dead thread.

-Matt
[← Prev in Thread] Current Thread [Next in Thread→]