The Paradyn project has a new technical report in the area of forensics of
binary machine code. This is cool new work on detecting authorship in binary
code.
Our full list of project publications always can be found at:
http://www.paradyn.org/html/publications-by-year.html
Comments and feedback on our papers is always welcome!
------------------------------------------------------------------------------
"Identifying Multiple Authors in a Binary Program",
Xiaozhu Meng, Barton P. Miller, and Kwang-Sung Jun
Submitted for publication, November 2016.
ftp://ftp.cs.wisc.edu/paradyn/papers/Meng16MultiAuthor.pdf
Abstract:
Binary code authorship identification is the task of determining program
authors from binary code; it has significant application to forensics of
malicious software (malware), software supply chain risk management, and
software plagiarism detection. Modern software, including malware, often
contains code from multiple authors. Recent studies have shown that
malware writers share code and information by forming physically co-
located teams or virtually through the Internet. Linking the authors who
have cooperated in writing a piece of malware can greatly help analysts
trace modern malware. Therefore, authorship identification techniques must
be able to identify multiple authors from binaries. Existing techniques
cast binary code authorship identification as a supervised machine
learning problem. Given a set of binaries with their author labels as the
training data, these techniques extract stylistic binary code features,
accumulate code features to the whole program level, and represent each
binary in the training set with a feature vector and an author label.
Supervised machine learning algorithms are then used to learn the
correlations between code features and author labels and make prediction
on the author of a new binary. However, these existing techniques assume
that each program binary is written by a single author, so can only
identify at most one of the multiple authors or report a merged group
identity, making it difficult to discover cooperating-author links.
We present new fine-grained techniques to address the tougher problem of
determining the author of each basic block. The decision of attributing
program authors at the basic block level is based on an empirical study
of three large open source software, in which we find that a large
fraction of basic blocks can be well attributed to a single author. We
present new code features that capture programming style at the basic
block level, our handling of inlined code from external template
libraries, and new machine learning models to capture correlations
between the authors of basic blocks in a binary. Our evaluation shows
that our new technique can discriminate the author of a basic block with
65% accuracy among 284 authors, compared to 0.4% accuracy by random guess.
As a proof of concept, we demonstrate using our new techniques to extract
cooperating-author links. We can recover links with 0.78 precision, 0.88
recall, and 0.83 F1-measure. In summary, our new techniques provide a
practical solution for identifying multiple authors in software and
extract connections between authors from binaries.
|