Toward Obfuscation-resilient Plagiarism Detection

Open Access
Zhang, Fangfang
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
March 31, 2014
Committee Members:
  • Sencun Zhu, Dissertation Advisor
  • Peng Liu, Committee Member
  • Guohong Cao, Committee Member
  • David Miller, Committee Member
  • Plagiarism detection
  • Smartphone app repackaging
  • Program birthmark
  • Obfuscation
In the field of software development, plagiarism is an act of violating intellectual property rights. Plagiarists either illegally copy others' source/binary code (also known as software plagiarism) or steal others' algorithms and covertly implement them (called algorithm plagiarism). Code obfuscation techniques are often applied by plagiarists to evade detection. Plagiarism has become a serious concern for honest software companies and the open source community. Besides, along with the wide use of mobile devices such as smartphones and tablets and the rapid growth of mobile application (app) markets, mobile app repackaging, as a new kind of software plagiarism, has emerged. It not only harms the health of app markets but also hurts the security of mobile users. As a result, computer-aided, automated plagiarism detection is desired. There are two common requirements for a good plagiarism detection scheme: (R1) Capability to work on suspicious executables without the source code; (R2) Resiliency to code obfuscation techniques. In this dissertation, we propose an obfuscation resilient plagiarism detection architecture, which satisfies the above requirements. It contains three components: LoPD, a program logic-based approach to software plagiarism detection, ValPD, a dynamic value-based approach to algorithm plagiarism detection, and ViewDroid, a user interface-based approach for Android application repackaging detection. LoPD is a program logic-based software plagiarism detection method. Instead of directly comparing the similarity between two programs, LoPD searches for any dissimilarity between two programs by finding an input that will cause these two programs to behave differently, either with different output states or with semantically different execution paths. As long as we can find one dissimilarity, the programs are semantically different; otherwise, it is likely a plagiarism case. We leverage symbolic execution and weakest precondition reasoning to capture the semantics of execution paths and to find path dissimilarities. LoPD is resilient to current automatic obfuscation techniques. In addition, since LoPD is a formal program semantics-based method, we can provide a formal guarantee of resilience against most known obfuscation attacks. Our evaluation results indicate that LoPD is both effective and efficient in detecting software plagiarism. In the ValPD component, we propose two dynamic value-based approaches, namely N-version and annotation, for algorithm plagiarism detection. Our approaches are motivated by the observation that there exist some critical runtime values which are irreplaceable and uneliminatable for all implementations of the same algorithm. The N-version approach extracts such values by filtering out non-core values. The annotation approach leverages auxiliary information to flag important variables which contain core values. We also propose a value dependence graph-based similarity metric to address the potential value reordering attack. A prototype is implemented and evaluated. The results show that our approaches to algorithm plagiarism detection are practical, effective and resilient to many automatic obfuscation techniques. Lastly, we propose ViewDroid, a user interface-based approach to smartphone application repackaging detection. Android applications are user interaction intensive and event dominated; the interactions between users and apps are performed through user interface (i.e., views). This inspired the design of our new birthmark for Android applications, namely, feature view graph, which captures user's navigation behavior across application views. Our experimental results demonstrate that this birthmark can characterize Android applications from a higher abstraction, making it resilient to code obfuscation. It can detect repackaged apps in large-scale scenarios both effectively and efficiently. Manual verification for the reported pairs shows that the false positive rate and false negative rate of ViewDroid are very low.