[DynInst_API:] New tech report on binary code authorship


Date: Wed, 23 Nov 2016 10:44:37 -0600
From: Barton Miller <bart@xxxxxxxxxxx>
Subject: [DynInst_API:] New tech report on binary code authorship
The Paradyn project has a new technical report in the area of forensics of
binary machine code.  This is cool new work on detecting authorship in binary
code.

Our full list of project publications always can be found at:
 http://www.paradyn.org/html/publications-by-year.html

Comments and feedback on our papers is always welcome!

------------------------------------------------------------------------------
"Identifying Multiple Authors in a Binary Program",
Xiaozhu Meng, Barton P. Miller, and Kwang-Sung Jun
Submitted for publication, November 2016.
ftp://ftp.cs.wisc.edu/paradyn/papers/Meng16MultiAuthor.pdf

Abstract:
   Binary code authorship identification is the task of determining program
   authors from binary code; it has significant application to forensics of
   malicious software (malware), software supply chain risk management, and
   software plagiarism detection. Modern software, including malware, often
   contains code from multiple authors. Recent studies have shown that
   malware writers share code and information by forming physically co-
   located teams or virtually through the Internet. Linking the authors who
   have cooperated in writing a piece of malware can greatly help analysts
   trace modern malware. Therefore, authorship identification techniques must
   be able to identify multiple authors from binaries. Existing techniques
   cast binary code authorship identification as a supervised machine
   learning problem. Given a set of binaries with their author labels as the
   training data, these techniques extract stylistic binary code features,
   accumulate code features to the whole program level, and represent each
   binary in the training set with a feature vector and an author label.
   Supervised machine learning algorithms are then used to learn the
   correlations between code features and author labels and make prediction
   on the author of a new binary. However, these existing techniques assume
   that each program binary is written by a single author, so can only
   identify at most one of the multiple authors or report a merged group
   identity, making it difficult to discover cooperating-author links.

   We present new fine-grained techniques to address the tougher problem of
   determining the author of each basic block. The decision of attributing
   program authors at the basic block level is based on an empirical study
   of three large open source software, in which we find that a large
   fraction of basic blocks can be well attributed to a single author. We
   present new code features that capture programming style at the basic
   block level, our handling of inlined code from external template
   libraries, and new machine learning models to capture correlations
   between the authors of basic blocks in a binary. Our evaluation shows
   that our new technique can discriminate the author of a basic block with
   65% accuracy among 284 authors, compared to 0.4% accuracy by random guess.
   As a proof of concept, we demonstrate using our new techniques to extract
   cooperating-author links. We can recover links with 0.78 precision, 0.88
   recall, and 0.83 F1-measure. In summary, our new techniques provide a
   practical solution for identifying multiple authors in software and
   extract connections between authors from binaries.
[← Prev in Thread] Current Thread [Next in Thread→]