Greg,
When I tried the "sess-06" example on a Linux cluster with SLURM RM,
I got some debug message like
" failing to getAdressRanges for mpi_ringtopo2.cpp" .
Would you please take a look at the debug message and tell me if there is anything wrong?
$ stat-cl -X $PWD/libsess-06.so -L $HOME/logs -l FE -l BE -l CP -M -C srun -n 8 ../bin/mpi_ringtopo2 20
STAT started at 2014-06-16-09:03:12
Launching application and tool daemons...
Tool daemons launched and connected!
Attaching to application...
Attached!
Resuming the application...
Resumed!
## Prototype DysectAPI enabled ##
Notice: Traditional sampling is disabled troughout session!
Setting up frontend session '/nfs_shared/STAT/STAT-2.1/share/STAT/examples/sessions/libsess-06.so'...
<Jun 16 09:03:13> DysectAPI Frontend: Verbose > Break on enter key: yes
<Jun 16 09:03:13> DysectAPI Frontend: Verbose > Break on timeout: no
<Jun 16 09:03:13> DysectAPI Frontend: Info > _expression_: '1,4,5' has been resolved to daemon ranks: 1,
<Jun 16 09:03:13> DysectAPI Frontend: Info > _expression_: '4,5' has been resolved to daemon ranks: 1,
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[94]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[73]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
Symtab.C[2345]: failing to getAdressRanges for mpi_ringtopo2.cpp[60]
<Jun 16 09:03:13> DysectAPI Frontend: Info > DysectAPI setup took 405 ms
Dysect session setup complete
Application already running... ignoring request to resume
Waiting for events (! denotes captured event)
Hit <enter> to stop session
Sampling traces...
Traces sampled!
Merging traces...
Traces merged!
cn113, MPI task 1 of 8 stalling for 20 of 20 seconds
<Jun 16 09:03:19> DysectAPI Frontend: Info > [8] Stack trace:
<Jun 16 09:03:19> DysectAPI Frontend: Info > |-> [8] _start > __libc_start_main > main > do_SendOrStall(int, int, int, int*, int*, int)
<Jun 16 09:03:19> DysectAPI Frontend: Info > [8] Trace: Location is 'mpi_ringtopo2.cpp:73'
Sampling traces...
Traces sampled!
Merging traces...
Traces merged!
<Jun 16 09:03:21> DysectAPI Frontend: Info > [2] Stack trace:
<Jun 16 09:03:21> DysectAPI Frontend: Info > |-> [2] _start > __libc_start_main > main > do_SendOrStall(int, int, int, int*, int*, int)
<Jun 16 09:03:21> DysectAPI Frontend: Info > [2] Trace: rank = [4:5] : Location is do_SendOrStall(int, int, int, int*, int*, int):mpi_ringtopo2.cpp:94
<Jun 16 09:03:21> DysectAPI Frontend: Info > [2] Trace: Function is 'do_SendOrStall(int, int, int, int*, int*, int)'
<Jun 16 09:03:21> DysectAPI Frontend: Info > [2] Trace: Location is 'mpi_ringtopo2.cpp:94'
<Jun 16 09:03:21> DysectAPI Frontend: Info > [2] Trace: Rank is 'rank = [4:5] '
<Jun 16 09:03:22> DysectAPI Frontend: Info > [2] Trace: Location is ?:0'
cn113, MPI task 1 of 8 stalling for 10 of 20 seconds
cn113, MPI task 1 of 8 proceeding
Sampling traces...
Traces sampled!
Merging traces...
Traces merged!
<Jun 16 09:03:41> DysectAPI Frontend: Info > [1] Stack trace:
<Jun 16 09:03:41> DysectAPI Frontend: Info > |-> [1] _start > __libc_start_main > main > do_SendOrStall(int, int, int, int*, int*, int)
<Jun 16 09:03:41> DysectAPI Frontend: Info > [1] Trace: rank = [1:1] : Location is do_SendOrStall(int, int, int, int*, int*, int):mpi_ringtopo2.cpp:94
<Jun 16 09:03:41> DysectAPI Frontend: Info > [1] Trace: Function is 'do_SendOrStall(int, int, int, int*, int*, int)'
<Jun 16 09:03:41> DysectAPI Frontend: Info > [1] Trace: Location is 'mpi_ringtopo2.cpp:94'
<Jun 16 09:03:41> DysectAPI Frontend: Info > [1] Trace: Rank is 'rank = [1:1] '
Sampling traces...
Traces sampled!
Merging traces...
Traces merged!
srun: error: task 0 launch failed: Slurmd could not execve job
srun: error: task 1 launch failed: Slurmd could not execve job
srun: error: task 2 launch failed: Slurmd could not execve job
srun: error: task 3 launch failed: Slurmd could not execve job
srun: error: task 4 launch failed: Slurmd could not execve job
srun: error: task 5 launch failed: Slurmd could not execve job
srun: error: task 6 launch failed: Slurmd could not execve job
srun: error: task 7 launch failed: Slurmd could not execve job
mpi_ringtopo Done
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[cn113]: *** STEP 48.1 CANCELLED AT 2014-06-16T09:03:51 ***
srun: error: cn113: task 0: Aborted (core dumped)
<Jun 16 09:03:52> DysectAPI Frontend: Info > Stopping session - application has exited
Detaching from application...
Results written to /nfs_shared/STAT/STAT-2.1/share/STAT/examples/sessions/stat_results/mpi_ringtopo2.0002
Regards,
Jie
From: lee218@xxxxxxxx
To: yangtzj@xxxxxxxxxxx; legendre1@xxxxxxxx
CC: dyninst-api@xxxxxxxxxxx
Subject: RE: [DynInst_API:] How to use DyninstAPI to get thread/process call stack in case of a signal delivery
Date: Tue, 22 Apr 2014 15:29:08 +0000
Jie,
The DysectAPI is currently in prototype form, but is available in STAT 2.1. Note that it is currently just a prototype, so it is not documented and is most
likely very buggy! When you configure STAT, supply the --enable-dysectapi flag. After building STAT, there are example “sessions” in the “examples/sessions” directory. From there, you can run:
% <your_stat_prefix>/bin/dysectc sess-06.cpp
% export STAT_GROUP_OPS=1
% <your_stat_prefix>/bin/stat-cl -X $PWD/libsess-06.so -L $HOME/logs -l FE -l BE -l CP -M -C srun -n 8 ../src/mpi_ringtopo2 20
The sess-06.cpp file (which is run in the commands above) should give you an idea of some of the general DysectAPI features. The session below shows how to
gather stack traces on a signal. When the application is running, you can send a `kill -10 <PID>` to one of the MPI processes to trigger the stack trace sampling.
% cat onsigusr.cpp
#include <LibDysectAPI.h>
DysectStatus DysectAPI::onProcStart() {
Probe *p = new Probe(Async::signal(SIGUSR1),
Domain::world(500),
Act::stat());
ProbeTree::addRoot(p);
return DysectOK;
}
% <your_stat_prefix>/bin/dysectc onsigusr.cpp
% export STAT_GROUP_OPS=1
% <your_stat_prefix>/bin/stat-cl -X $PWD/libonsigusr.so -L $HOME/logs -l FE -l BE -C srun -n 8 ../src/mpi_ringtopo2 30
STAT started at 2014-04-22-08:18:33
Launching application and tool daemons...
Tool daemons launched and connected!
Attaching to application...
Attached!
Resuming the application...
Resumed!
## Prototype DysectAPI enabled ##
Notice: Traditional sampling is disabled troughout session!
Setting up frontend session '/g/g0/lee218/src/STAT/examples/sessions/libonsigusr.so'...
<Apr 22 08:18:33> DysectAPI Frontend: Verbose > Break on enter key: yes
<Apr 22 08:18:33> DysectAPI Frontend: Verbose > Break on timeout: no
<Apr 22 08:18:33> DysectAPI Frontend: Info > DysectAPI setup took 4 ms
Dysect session setup complete
Application already running... ignoring request to resume
Waiting for events (! denotes captured event)
Hit <enter> to stop session
rzmerl14, MPI task 1 of 8 stalling for 30 of 30 seconds
rzmerl14, MPI task 1 of 8 stalling for 20 of 30 seconds
Sampling traces...
Traces sampled!
Merging traces...
Traces merged!
srun: error: rzmerl14: task 4: User defined signal 1
rzmerl14, MPI task 1 of 8 stalling for 10 of 30 seconds
rzmerl14, MPI task 1 of 8 proceeding
srun: First task exited 30s ago
srun: tasks 0-3,5-7: running
srun: task 4: exited abnormally
srun: Terminating job step 1947661.2
slurmd[rzmerl14]: *** STEP 1947661.2 KILLED AT 2014-04-22T08:19:24 WITH SIGNAL 9 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[rzmerl14]: *** STEP 1947661.2 KILLED AT 2014-04-22T08:19:24 WITH SIGNAL 9 ***
<Apr 22 08:19:25> DysectAPI Frontend: Info > Stopping session - application has exited
Detaching from application...
Detached!
Results written to /g/g0/lee218/src/STAT/examples/sessions/stat_results/mpi_ringtopo2.0439
% <your_stat_prefix>/bin/stat-view stat_results/mpi_ringtopo2.0439/*.dot
In this example, the “Domain::world(500)” argument to the probe means that the probe is applied to all processes (the world). Only 1 process needs to receive
the SIGUSR1 signal, but it will wait 500ms in case other processes get this signal too and then after the 500ms will gather the STAT stack trace.
Let me know how this works for you or if you have any questions.
-Greg