A Framework for Mining Significant Subgraphs and Its Application in Malware Analysis

Open Access
Palahan, Sirinda
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
April 04, 2014
Committee Members:
  • Daniel Kifer, Dissertation Advisor
  • Robert Collins, Committee Member
  • John Joseph Hannan, Committee Member
  • C Lee Giles, Committee Member
  • Lee David Coraor, Committee Member
  • Significant subgraph
  • subgraph mining
  • malware behavior
  • malware detection
The growth of graph data has encouraged research in graph mining algorithms, especially subgraph pattern mining from graph databases. Discovered patterns could help researchers understand inherent properties and characteristics in large and complex graphs. Frequent subgraph mining has been widely applied successfully in many applications, such as mining network motifs in a complex network, identifying malicious behaviors or mining biochemical structures. However, the high frequency of a subgraph does not always indicate that a subgraph is statistically significant. In this dissertation, we propose a framework for mining statistically significant subgraphs. Our framework is based on a new method for measuring the statistical significance of subgraphs. Given a training set of graphs from two classes (e.g., positive and negative graphs), our method utilizes the class labels provided in the training data to calculate p-values. The p-values reflect how significant the subgraphs are in one class with respect to a null distribution. Our method can assign p-values to subgraphs of new graph instances even if those subgraphs have not appeared before in the training data. We apply our framework to malware analysis where we extract malicious behaviors from malware executables and calculate their p-values. We focus on this problem because malware is still a serious threat to our society. Traditionally, analysis of malicious software is only a semi-automated process, often requiring a skilled human analyst. As new malware appears at an increasingly alarming rate, now over 100,000 new variants each day, there is a need for automated techniques for identifying suspicious behavior in programs. The contribution of this dissertation is two-fold. (1) We propose a framework for extracting statistically significant subgraphs and apply the framework to identify significant behaviors from malware samples. (2) We develop a methodology for evaluating the quality of significant malware behaviors. The experimental results showed that our framework was able to identify behaviors that are both statistically significant and malicious based on a description by the malware expert. The results also showed that our framework could possibly able to detect unseen behaviors not previously seen in the training dataset.