[DynInst_API:] A new CUDA profiling library for accurate sync and execution time


Date: Sat, 18 Apr 2020 10:01:46 -0500
From: Barton Miller <bart@xxxxxxxxxxx>
Subject: [DynInst_API:] A new CUDA profiling library for accurate sync and execution time
Hi all. If you are working on CPU/GPU performance on CUDA, this might be
of interest.

stay well.

--bart

--------------------------------------------------------------------------
SUMMARY

I wanted to share with you a library that can help you accurately profile
CUDA applications, giving a complete picture of the time spent in each
CUDA library function and, more importantly and interestingly, the amount
of synchronization waiting time spent in each function.

Most of you are probably familiar with the current limitations of profiling
tools for Nvidia GPUs. These tools rely on Nvidia's CUPTI performance data
collection framework, which does not provide information on CPU/GPU
synchronizations in all but four functions in the CUDA library (libcuda).
Among approximately 450 CUDA API functions, CUPTI generates synchronization
timing information for only two functions - cuStreamSynchronize and
cuCtxSynchronize. Due to these limitations, existing tools provide incomplete
synchronization times to the user.

DETAILS:

Our group has developed a tool that that overcomes these limitations. This
tool produces an instrumented version of you CUDA library, producing a
profile of your application.  You can use this library by itself or, if you
are a tool developer, use it to enchance your data collection. The
instrumentation is done directly on the library binary code.

The instrumented library profiles a CUDA application and produces a list of
the CUDA API functions called by the application along with their execution
times and time spent by each function in synchronization. The library also
supports a callback mechanism to enable tracing at the granularity of a
single CUDA function call on a per-thread basis.

The library reports data in a way similar to CUPTI. The output is in a CSV
format that can either be consumed by a another application for further
analysis or can be viewed in a human-friendly way using a script that  we
provide.

This is a new pre-release evaluation version. Please contact us if you have
any questions, either to my email or dyninst-api@xxxxxxxxxxxx  And we'd
love to have your feedback and suggestions.

INSTALLATION AND BUILD:

The tool depends on the Dyninst binary instrumentation framework (that can
be installed by following the instructions at
https://github.com/dyninst/dyninst/wiki/Building-Dyninst),
Boost C++ libraries version 1.61 and (of course) a supported Nvidia GPU
(4xx series GPU driver versions have been tested). Dyninst can be built and
installed using cmake 3.1 or later as follows -

$ export LD_LIBRARY_PATH=<DYNINST_INSTALL_PREFIX>/lib/:<BOOST_INSTALL_PREFIX>/install/lib/:/usr/lib/x86_64-linux-gnu/
$ export DYNINSTAPI_RT_LIB=<DYNINST_INSTALL_PREFIX>/lib/libdyninstAPI_RT.so
$ git clone https://github.com/dyninst/tools.git
$ cd tools/cuda_sync_analyzer
$ mkdir build && cd build
$ cmake ..  \
  -DDYNINST_ROOT=<DYNINST_INSTALL_PREFIX> \
  -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda \
  -DBOOST_LIBRARYDIR=<BOOST_INSTALL_PREFIX>/install/lib \
  -DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo \
  -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON \
  -DCMAKE_INSTALL_PREFIX=<INSTALL_PREFIX>
$ make && make install

RELEVANT LINKS:

Github repository - https://github.com/dyninst/tools/tree/master/cuda_sync_analyzer

User manual - https://docs.google.com/document/d/1h12Uq-cQyNSRuajZQo9bhcpFFPZVmL1g-ztCRifze5s

PAPERS THAT DESCRIBE THIS WORK:

Benjamin Welton and Barton P. Miller, "Exposing Hidden Performance
Opportunities in High Performance GPU Applications", 18th IEEE/ACM
International Symposium on Cluster, Cloud and Grid Computing (CCGrid),
Washington, DC, May 2018. Best paper award.
ftp://ftp.cs.wisc.edu/paradyn/papers/welton-unobvious.pdf

Benjamin Welton and Barton P. Miller, "Diogenes: Looking For An Honest
CPU/GPU Performance Measurement Tool", Supercomputing 2019 (SC2019),
Denver, November 2019. 
ftp://ftp.cs.wisc.edu/paradyn/technical_papers/diogenes-sc2019.pdf

Benjamin Welton and Barton P. Miller, "Identifying and (Automatically)
Remedying Performance Problems in CPU/GPU Applications", International
Conference on Supercomputing (ICS), Barcelona, Spain, June 2020.
ftp://ftp.cs.wisc.edu/paradyn/technical_papers/welton_autocorrect.pdf
[← Prev in Thread] Current Thread [Next in Thread→]