The Paradyn project has two new technical reports, one in the area of binary
code analysis and the other MRNet scalability.
Our full list of project publications always can be found at:
http://www.paradyn.org/html/publications-by-year.html
Comments and feedback on these papers is always welcome!
------------------------------------------------------------------------------
"Binary Code Is Not Easy", Xiaozhu Meng and Barton P. Miller
Submitted for publication, March 2015.
ftp://ftp.cs.wisc.edu/paradyn/papers/Meng15Parsing.pdf
Abstract"
Binary code analysis is an enabling technique for many applications. Binary
code analysis tool kits provide several capabilities to automatically
analyze binary code, including decoding machine instructions, building
control flow graphs of the program, and determining function boundaries.
Modern compilers and run-time libraries have introduced significant
complexities to binary code. These complexities negatively affect the
capabilities of binary analysis tool kits, which may cause tools to report
inaccurate information about binary code. Analysts may hence be confused
and applications based on binary analysis tool kits may have degrading
quality. We examine the problem of constructing control flow graphs from
binary code and labeling the graphs with accurate function boundary
annotations. We identify eight challenging code constructs that represent
hard-to-analyze aspects of the code from modern compilers, and show code
examples that illustrate each code construct. As part of this discussion,
we discuss how the new code parsing algorithms in the open source Dyninst
tool kit support these constructs. In particular, we present a new model
for describing jump tables that improves our ability to precisely determine
the control ow targets, a new interprocedural analysis to determine when a
function is non-returning, and techniques for handling tail calls, code
overlapping between functions, and code overlapping within instructions.
We created test binaries patterned after each challenging code construct
and evaluated how tool kits, including Dyninst, fare when parsing these
binaries.
------------------------------------------------------------------------------
"Mr. Scan: A Hybrid/Hybrid Extreme Scale Density Based Clustering Algorithm",
Benjamin Welton and Barton Miller, submitted for publication, March 2015.
ftp://ftp.cs.wisc.edu/paradyn/papers/Welton15MrScan.pdf
Abstract:
Density-based clustering algorithms are a widely-used class of data mining
techniques that can find irregularly shaped clusters and cluster data
without prior knowledge of the number of clusters the data contains. DBSCAN
is the most well-known density-based clustering algorithm. We introduce our
extension of DBSCAN, called Mr. Scan, which uses a hybrid/hybrid parallel
implementation that combines the MRNet tree-based distribution network with
GPU-equipped nodes. Mr. Scan avoids the problems encountered in other
parallel versions of DBSCAN, such as scalability limits, reduction in output
quality at large scales, and the inability to effectively process dense
regions of data. Mr. Scan uses effective data partitioning and a new merging
technique to allow data sets to be broken into independently processable
partitions without the reduction in quality or large amount of node-to-node
communication seen in other parallel versions of DBSCAN. The dense box
algorithm designed as part of Mr. Scan allows for dense regions to be
detected and clustered without the need to individually compare all points
in these regions to one another. Mr. Scan was tested on both a geolocated
Twitter dataset and image data obtained from the Sloan Digital Sky Survey.
In testing Mr. Scan we performed end-to-end benchmarks measuring complete
application run time from reading raw unordered input point data from the
file system to writing the final clustered output to the file system. The
use of end-to-end benchmarking gives a clear picture of the performance that
can be expected from Mr. Scan in real world use cases. At its largest scale,
Mr. Scan clustered 6.5 billion points from the Twitter dataset on 8,192 GPU
nodes on Cray Titan in 7.5 minutes
------------------------------------------------------------------------------
|