Re: [DynInst_API:] How to use DyninstAPI to get thread/process call stack in case of a signal delivery


Date: Tue, 1 Jul 2014 23:12:18 +0800
From: JiangJie <yangtzj@xxxxxxxxxxx>
Subject: Re: [DynInst_API:] How to use DyninstAPI to get thread/process call stack in case of a signal delivery
Greg,

When I tried the "sess-06" example on a Linux cluster with SLURM RM,
I got some debug message like
" failing to getAdressRanges for mpi_ringtopo2.cpp" . 

Would you please take a look at the debug message and tell me if there is anything wrong?


$ stat-cl -X $PWD/libsess-06.so -L $HOME/logs -l FE -l BE -l CP -M -C srun -n 8 ../bin/mpi_ringtopo2 20
STAT started at 2014-06-16-09:03:12
Launching application and tool daemons...
Tool daemons launched and connected!
Attaching to application...
Attached!
Resuming the application...
Resumed!

## Prototype DysectAPI enabled ##
Notice: Traditional sampling is disabled troughout session!
Setting up frontend session '/nfs_shared/STAT/STAT-2.1/share/STAT/examples/sessions/libsess-06.so'...
<Jun 16 09:03:13> DysectAPI Frontend: Verbose > Break on enter key: yes
<Jun 16 09:03:13> DysectAPI Frontend: Verbose > Break on timeout: no
<Jun 16 09:03:13> DysectAPI Frontend: Info > _expression_: '1,4,5' has been resolved to daemon ranks: 1,
<Jun 16 09:03:13> DysectAPI Frontend: Info > _expression_: '4,5' has been resolved to daemon ranks: 1,
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]:  failing to getAdressRanges for mpi_ringtopo2.cpp[60]
<Jun 16 09:03:13> DysectAPI Frontend: Info > DysectAPI setup took 405 ms
Dysect session setup complete
Application already running... ignoring request to resume
Waiting for events (! denotes captured event)
Hit <enter> to stop session

Sampling traces...
Traces sampled!
Merging traces...
Traces merged!
cn113, MPI task 1 of 8 stalling for 20 of 20 seconds
<Jun 16 09:03:19> DysectAPI Frontend: Info > [8] Stack trace:
<Jun 16 09:03:19> DysectAPI Frontend: Info >  |-> [8] _start > __libc_start_main > main > do_SendOrStall(int, int, int, int*, int*, int)
<Jun 16 09:03:19> DysectAPI Frontend: Info > [8] Trace: Location is 'mpi_ringtopo2.cpp:73'

Sampling traces...
Traces sampled!
Merging traces...
Traces merged!
<Jun 16 09:03:21> DysectAPI Frontend: Info > [2] Stack trace:
<Jun 16 09:03:21> DysectAPI Frontend: Info >  |-> [2] _start > __libc_start_main > main > do_SendOrStall(int, int, int, int*, int*, int)
<Jun 16 09:03:21> DysectAPI Frontend: Info > [2] Trace: rank = [4:5] : Location is do_SendOrStall(int, int, int, int*, int*, int):mpi_ringtopo2.cpp:94
<Jun 16 09:03:21> DysectAPI Frontend: Info > [2] Trace: Function is 'do_SendOrStall(int, int, int, int*, int*, int)'
<Jun 16 09:03:21> DysectAPI Frontend: Info > [2] Trace: Location is 'mpi_ringtopo2.cpp:94'
<Jun 16 09:03:21> DysectAPI Frontend: Info > [2] Trace: Rank is 'rank = [4:5] '


<Jun 16 09:03:22> DysectAPI Frontend: Info > [2] Trace: Location is ?:0'
cn113, MPI task 1 of 8 stalling for 10 of 20 seconds
cn113, MPI task 1 of 8 proceeding

Sampling traces...
Traces sampled!
Merging traces...
Traces merged!
<Jun 16 09:03:41> DysectAPI Frontend: Info > [1] Stack trace:
<Jun 16 09:03:41> DysectAPI Frontend: Info >  |-> [1] _start > __libc_start_main > main > do_SendOrStall(int, int, int, int*, int*, int)
<Jun 16 09:03:41> DysectAPI Frontend: Info > [1] Trace: rank = [1:1] : Location is do_SendOrStall(int, int, int, int*, int*, int):mpi_ringtopo2.cpp:94
<Jun 16 09:03:41> DysectAPI Frontend: Info > [1] Trace: Function is 'do_SendOrStall(int, int, int, int*, int*, int)'
<Jun 16 09:03:41> DysectAPI Frontend: Info > [1] Trace: Location is 'mpi_ringtopo2.cpp:94'
<Jun 16 09:03:41> DysectAPI Frontend: Info > [1] Trace: Rank is 'rank = [1:1] '

Sampling traces...
Traces sampled!
Merging traces...
Traces merged!
srun: error: task 0 launch failed: Slurmd could not execve job
srun: error: task 1 launch failed: Slurmd could not execve job
srun: error: task 2 launch failed: Slurmd could not execve job
srun: error: task 3 launch failed: Slurmd could not execve job
srun: error: task 4 launch failed: Slurmd could not execve job
srun: error: task 5 launch failed: Slurmd could not execve job
srun: error: task 6 launch failed: Slurmd could not execve job
srun: error: task 7 launch failed: Slurmd could not execve job
mpi_ringtopo Done
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[cn113]: *** STEP 48.1 CANCELLED AT 2014-06-16T09:03:51 ***
srun: error: cn113: task 0: Aborted (core dumped)
<Jun 16 09:03:52> DysectAPI Frontend: Info > Stopping session - application has exited
Detaching from application...

Results written to /nfs_shared/STAT/STAT-2.1/share/STAT/examples/sessions/stat_results/mpi_ringtopo2.0002




Regards,
Jie


From: lee218@xxxxxxxx
To: yangtzj@xxxxxxxxxxx; legendre1@xxxxxxxx
CC: dyninst-api@xxxxxxxxxxx
Subject: RE: [DynInst_API:] How to use DyninstAPI to get thread/process call stack in case of a signal delivery
Date: Tue, 22 Apr 2014 15:29:08 +0000

Jie,

 

The DysectAPI is currently in prototype form, but is available in STAT 2.1.  Note that it is currently just a prototype, so it is not documented and is most likely very buggy!  When you configure STAT, supply the --enable-dysectapi flag.  After building STAT, there are example “sessions” in the “examples/sessions” directory.  From there, you can run:

 

% <your_stat_prefix>/bin/dysectc sess-06.cpp

% export STAT_GROUP_OPS=1

% <your_stat_prefix>/bin/stat-cl -X $PWD/libsess-06.so -L $HOME/logs -l FE -l BE -l CP -M -C srun -n 8 ../src/mpi_ringtopo2 20

 

The sess-06.cpp file (which is run in the commands above) should give you an idea of some of the general DysectAPI features.  The session below shows how to gather stack traces on a signal.  When the application is running, you can send a `kill -10 <PID>` to one of the MPI processes to trigger the stack trace sampling. 

 

% cat onsigusr.cpp

#include <LibDysectAPI.h>

 

DysectStatus DysectAPI::onProcStart() {

  Probe *p = new Probe(Async::signal(SIGUSR1),

                       Domain::world(500),

                       Act::stat());

  ProbeTree::addRoot(p);

  return DysectOK;

}

 

% <your_stat_prefix>/bin/dysectc onsigusr.cpp

 

% export STAT_GROUP_OPS=1

 

% <your_stat_prefix>/bin/stat-cl -X $PWD/libonsigusr.so -L $HOME/logs -l FE -l BE  -C srun -n 8 ../src/mpi_ringtopo2 30

STAT started at 2014-04-22-08:18:33

Launching application and tool daemons...

Tool daemons launched and connected!

Attaching to application...

Attached!

Resuming the application...

Resumed!

 

## Prototype DysectAPI enabled ##

Notice: Traditional sampling is disabled troughout session!

Setting up frontend session '/g/g0/lee218/src/STAT/examples/sessions/libonsigusr.so'...

<Apr 22 08:18:33> DysectAPI Frontend: Verbose > Break on enter key: yes

<Apr 22 08:18:33> DysectAPI Frontend: Verbose > Break on timeout: no

<Apr 22 08:18:33> DysectAPI Frontend: Info > DysectAPI setup took 4 ms

Dysect session setup complete

Application already running... ignoring request to resume

Waiting for events (! denotes captured event)

Hit <enter> to stop session

rzmerl14, MPI task 1 of 8 stalling for 30 of 30 seconds

rzmerl14, MPI task 1 of 8 stalling for 20 of 30 seconds

 

Sampling traces...

Traces sampled!

Merging traces...

Traces merged!

srun: error: rzmerl14: task 4: User defined signal 1

rzmerl14, MPI task 1 of 8 stalling for 10 of 30 seconds

rzmerl14, MPI task 1 of 8 proceeding

srun: First task exited 30s ago

srun: tasks 0-3,5-7: running

srun: task 4: exited abnormally

srun: Terminating job step 1947661.2

slurmd[rzmerl14]: *** STEP 1947661.2 KILLED AT 2014-04-22T08:19:24 WITH SIGNAL 9 ***

srun: Job step aborted: Waiting up to 2 seconds for job step to finish.

slurmd[rzmerl14]: *** STEP 1947661.2 KILLED AT 2014-04-22T08:19:24 WITH SIGNAL 9 ***

<Apr 22 08:19:25> DysectAPI Frontend: Info > Stopping session - application has exited

Detaching from application...

Detached!

 

Results written to /g/g0/lee218/src/STAT/examples/sessions/stat_results/mpi_ringtopo2.0439

 

% <your_stat_prefix>/bin/stat-view stat_results/mpi_ringtopo2.0439/*.dot

 

 

In this example, the “Domain::world(500)” argument to the probe means that the probe is applied to all processes (the world).  Only 1 process needs to receive the SIGUSR1 signal, but it will wait 500ms in case other processes get this signal too and then after the 500ms will gather the STAT stack trace.

 

Let me know how this works for you or if you have any questions.

 

                -Greg

 



[← Prev in Thread] Current Thread [Next in Thread→]