[pl-seminar] CS703 Project presentations Dec 6/8/11 1-2pm Compsci 1263


Date: Tue, 05 Dec 2017 18:59:31 -0600
From: "Loris D'Antoni" <loris@xxxxxxxxxxx>
Subject: [pl-seminar] CS703 Project presentations Dec 6/8/11 1-2pm Compsci 1263
Wednesday 12/6

Learning from Programs Through Parameterized Matrix Embeddings (Jordan Henkel)
Manually modelling program behavior is a costly and error-prone endeavor. As a result of this many critical projects, especially open source projects where contributors are often already overburdened, cannot reap the benefits of complex static analyses.
To solve this problem we propose a "push-button" system to learn models of program behavior by leveraging a large corpus of related open source software. Specifically, we find a way to embed symbolic traces (abstracted by a set of parameterized predicates) into a space of Boolean matrices. Using this embedding we can apply off-the-shelf machine learning techniques to produce interesting results.

Code Language Model with Subtoken Information (Jinman Zhao)
With the naturalness hypothesis that software corpora share similar statistical properties with natural language corpora, one may exploit models and techniques from machine learning and natural language processing to build tools that benefit software engineering. At the core often lies a code language model as a way to capture naturalness and a gateway to applications such as code generation, code anomaly detectionÂandÂcode pattern mining. A code language model is a probability distribution over all possible occurrences of code that models howÂcodeÂis generated. Despite many works utilizing language model for code, almost all of them view tokens as atoms with no inner structure, which ignore the potentially rich information at lexical level. Furthermore, the ominous presence of neologisms such as variables and methods intensifies the problem of rare or out-of-vocabulary tokens in language modeling. In this work, we seek to enhance code language model with subtoken information such as character-level n-gram and subtoken vocabulary. We train our model over more than 100k of Python function definitions and evaluate the learned model with perplexity and downstream applications such as code completion and anomaly detection.

Synthesis of Temporal Gene Regulatory Networks (Aaron Baker)
We live in exciting times for biological research. The completion of the Human Genome Project in the year 2000 promised the ability to connect a disease with its genetic causes. However, after many years of research into these associations, it appears many seemingly unrelated genes are weakly responsible for disease [1]. That is, disease phenotypes are caused by complicated networks of interactions among genes. A popular example of this phenomenon is the lac operon, a system of genes in E. coli and other bacteria. When lactose is present in a bacterium's environment, genes responsible for digesting the lactose are turned on. Understanding these types of interactions is crucial for researchers to realize the original promise of the Human Genome Project. Interactions may be described in terms of logical formulae relating the _expression_ of an input gene on its target. We can learn these interactions through synthesis of these formulae. In this work, I present an adaption of synthesis techniques originally used to understand the development of the sea urchin [2] to confirm that these techniques can recapitulate our understanding of the lac operon. From there, I show how synthesize novel formulae describing previously unreported genetic interactions.Â

Attack Analysis on Ethereum Smart Contracts (Zhicheng Cai)
Blockchain techniques allow untrusted individuals to achieve a consensus on some state machine. While Bitcoin supports a decentralized cryptocurrency, Ethereum platform provides a way for everyone to create and execute Turing-complete DApps, which we call smart contracts. These DApps can be written to create peopleâs own arbitrary rules for ownership, transaction formats and state transaction functions, and just play as âautonomous agentsâ, handling billions of dollars on Ethereum blockchain. Unfortunately, there has been millions of dollars stolen, due to the security issues in smart contracts. Detecting bugs and writing secure contracts is of much importance, but the correctness of executions alone is not sufficient to make smart contracts secure. This work classifies 3 main categories, 10 detailed types of attacks and treats each concrete attack example as a VAT tuple â (Victim, Attacker, Trigging scripts). The victims and attackers contracts are written in Solidity, the most popular high-level language for Ethereum smart contracts, and the trigging scripts are written in _javascript_. Our VAT tuples can be configured and run on Truffle testing platform, which recently supports debugging for smart contracts execution, along with local Ethereum TestRPC. Future work may extract the abstraction of attacks from more concrete examples and help to detect possible vulnerabilities of smart contracts.

Friday 12/8

Explaining the Behaviors of an ML Model via Formal Language Learning (David Merrell)
Despite their popularity and commercial success, neural networks remain opaque to human users. Even with access to a neural network's architecture and parameters, it is difficult to gain any insight or explanation of "why" a network makes its decisions. In this work, we set our sights on a Recurrent Neural Network model and attempt to use methods from formal language learning to produce human-intepretable explanations of its decisions. Specifically, we attempt to explain the behavior of SLANG, an automatic code-completion framework (Raychev et al., 2014).

Refinement of Program Repairing via Direct State Manipulation (Rong Pan, Yuanfang Song)
Different from existing program repair frameworks, the specification mechanism called direct state manipulation allows a user to visualize the trace of a buggy program and specify intended behaviors by manipulating certain variableâs value at some location. Under this mechanism, we extend the definition of distance between programs for the recursive and interprocedural case and improve the ability to find solutions for repair tool JDial.
For the scenario that the manipulated variable involves a large value, we rewrite each assignment as a sequence of assignments, where each assigned value modulo a prime number. Another improvement is that we manage to repair recursive function and function which invokes another function, where the invoked function might also be a recursive function, and the repair may happen in any related function. The main adaptation for calculating the distance between programs is to consider execution order of each line of code to derive edit distance between two orders and use a stack to keep track of nested level. For the improvements mentioned above, We get desired outputs for representative benchmarks.

- Recursive Program Synthesis using Java (Bob Effinger)
Abstract. The process of using input and output example to recursively synthesize programs was outlined in a paper called Recursive Program Synthesis by Aws Albarghouthi, Sumit Gulwani, and Zachary Kincaid. For my project I attempted to reproduce their program called ESCHER in Java. In addition to the input examples, the instructions that can be used in the generated programs must also be specified. My goal for this project was to become more familiar with synthesis methods, and the largest method that I was used in the synthesis was enumeration. With this paper I intend to demonstrate the capabilities of my program, and explain some of the problems that I ran into while implementing the algorithm. In addition I will show some elements that I left out due to time restrictions, and a few things that I would have done differently if I was to rebuild the project from the ground up.

Monday 12/11

Reactive Synthesis of LTL liveness properties in SDN (Hsuan-Heng Wu)
Liveness Properties are useful in the context of SDN to model behaviors such as intrusion detection (A malicious client is eventually blocked) or reason about causal relationship. (A request eventually leads to some result). While there are plenty of works on safety properties in SDNs, liveness properties of SDN programs is not as well a well-studied area. What Iâm trying to achieve in my project is to see whether we can apply the reactive synthesis technique to synthesize SDN programs for liveness properties specified in LTL under the one-big-switch SDN model. One big switch SDN model is a model that simplifies the distributed control planes as a single switch with a table of forwarding rules, where each rule can be
activated/deactivated based on controller policy. One big switch model is a simple but expressive model that is useful for network analysis, thus making it an interesting target. To model this problem as a reactive synthesis problem, we essentially want to come up with a controller where the environment is in-coming packets, and the controller gets to decide how forwarding rules are updated according to the incoming packets such that there wonât be any violation of specified properties.

- PatchEmbedding: towards the understanding of similarity of File System Patch (Jing Liu)
File systemsÂplay a crucial role in operating system and serves as the foundation of most of the modern distributed systems, however, hard to ensure the correctness due to its complexity. In this, it is important to make it easy to fix the issues of the deficient file system. This work, based on a dataset of 7-year linux file system patch, would like to define the distance of different file system patches. Thus, developers would be able to easily find the related patches while analyzing a problem and trying to fix the bugs. The key idea is that, we regard the patches and the huge file system codebase as a more restricted 'natural language system'. Each patch, is composed of several sentences. Specifically, one changed block and its +++, --- will form two different sentences (aka. before the patch and patched code). And the word in this language system will be the variables and datatype as well as operations in C programming language. The patch is natural to this model, since it provides the context to make it like sentences from the intuition. Then we use word2vec model to model the words in this language system into vectors. To derive the similarity of patches, we use the idea that each patch will be a Bow (bag of word), and will in some distribution, then based on the metric of word2vec result, use the EMD distance model to solve the distance between two different patches. (From word embedding to document distances, ICML15). To validate this similarity definition, we randomly select some of the patches, and then use the annotated tags to verify. The result shows that this two distances have some co-relation.

Partial Program Synthesis Using a Generative Probability Model (Moayad Alnammi)
In this project, we investigate a procedural statistical technique for generating partial programs of string manipulation problems guided by intent extracted from input-output examples. We formalize the problem of predicting a partial program given intent as a maximum-likelihood-estimate (MLE) problem. Each partial program is encoded as a fixed length vector for consumption by a Machine Learning model. The formalization defines the distributions and dependency among the elements in the encoding. This dependency allows us to procedurally generate partial program tokens one after the other given the previously generated tokens. We use a dataset of partial programs along with their input-output example intents to train two models (simple vs. complex). Evaluation is performed on a held-out dataset to determine the efficacy of this generative technique. We identify the limitations and opportunities for future work using this method.
[← Prev in Thread] Current Thread [Next in Thread→]