Jie,
The DysectAPI is currently in prototype form, but is available in STAT 2.1. Note that it is currently just a prototype, so it is not documented and is most
likely very buggy! When you configure STAT, supply the --enable-dysectapi flag. After building STAT, there are example “sessions” in the “examples/sessions” directory. From there, you can run:
% <your_stat_prefix>/bin/dysectc sess-06.cpp
% export STAT_GROUP_OPS=1
% <your_stat_prefix>/bin/stat-cl -X $PWD/libsess-06.so -L $HOME/logs -l FE -l BE -l CP -M -C srun -n 8 ../src/mpi_ringtopo2 20
The sess-06.cpp file (which is run in the commands above) should give you an idea of some of the general DysectAPI features. The session below shows how to
gather stack traces on a signal. When the application is running, you can send a `kill -10 <PID>` to one of the MPI processes to trigger the stack trace sampling.
% cat onsigusr.cpp
#include <LibDysectAPI.h>
DysectStatus DysectAPI::onProcStart() {
Probe *p = new Probe(Async::signal(SIGUSR1),
Domain::world(500),
Act::stat());
ProbeTree::addRoot(p);
return DysectOK;
}
% <your_stat_prefix>/bin/dysectc onsigusr.cpp
% export STAT_GROUP_OPS=1
% <your_stat_prefix>/bin/stat-cl -X $PWD/libonsigusr.so -L $HOME/logs -l FE -l BE -C srun -n 8 ../src/mpi_ringtopo2 30
STAT started at 2014-04-22-08:18:33
Launching application and tool daemons...
Tool daemons launched and connected!
Attaching to application...
Attached!
Resuming the application...
Resumed!
## Prototype DysectAPI enabled ##
Notice: Traditional sampling is disabled troughout session!
Setting up frontend session '/g/g0/lee218/src/STAT/examples/sessions/libonsigusr.so'...
<Apr 22 08:18:33> DysectAPI Frontend: Verbose > Break on enter key: yes
<Apr 22 08:18:33> DysectAPI Frontend: Verbose > Break on timeout: no
<Apr 22 08:18:33> DysectAPI Frontend: Info > DysectAPI setup took 4 ms
Dysect session setup complete
Application already running... ignoring request to resume
Waiting for events (! denotes captured event)
Hit <enter> to stop session
rzmerl14, MPI task 1 of 8 stalling for 30 of 30 seconds
rzmerl14, MPI task 1 of 8 stalling for 20 of 30 seconds
Sampling traces...
Traces sampled!
Merging traces...
Traces merged!
srun: error: rzmerl14: task 4: User defined signal 1
rzmerl14, MPI task 1 of 8 stalling for 10 of 30 seconds
rzmerl14, MPI task 1 of 8 proceeding
srun: First task exited 30s ago
srun: tasks 0-3,5-7: running
srun: task 4: exited abnormally
srun: Terminating job step 1947661.2
slurmd[rzmerl14]: *** STEP 1947661.2 KILLED AT 2014-04-22T08:19:24 WITH SIGNAL 9 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[rzmerl14]: *** STEP 1947661.2 KILLED AT 2014-04-22T08:19:24 WITH SIGNAL 9 ***
<Apr 22 08:19:25> DysectAPI Frontend: Info > Stopping session - application has exited
Detaching from application...
Detached!
Results written to /g/g0/lee218/src/STAT/examples/sessions/stat_results/mpi_ringtopo2.0439
% <your_stat_prefix>/bin/stat-view stat_results/mpi_ringtopo2.0439/*.dot
In this example, the “Domain::world(500)” argument to the probe means that the probe is applied to all processes (the world). Only 1 process needs to receive
the SIGUSR1 signal, but it will wait 500ms in case other processes get this signal too and then after the 500ms will gather the STAT stack trace.
Let me know how this works for you or if you have any questions.
-Greg