Deep Learning for Security-oriented Program Analysis

Restricted (Penn State Only)
- Author:
- Wang, Zhilong
- Graduate Program:
- Informatics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- October 20, 2023
- Committee Members:
- Jeffrey Bardzell, Program Head/Chair
Peng Liu, Chair & Dissertation Advisor
Suhang Wang, Major Field Member
Sencun Zhu, Outside Unit & Field Member
Amulya Yadav, Major Field Member - Keywords:
- Deep Learning
Software Security
Program Analysis
Binary Analysis - Abstract:
- Deep learning methods have revolutionized the field of Natural Language Processing and Computer Vision with their exceptional capabilities. This success has intrigued the attention of security researchers, to explore its potential for addressing security problems. Lately, many works tried to apply deep learning to security-centric program analysis tasks, like reverse engineering, code similarity detection, etc. However, given the complex structure and dependency relationships inside programs, the popular models in related research often only recognize superficial features from binary and even source code, falling short in capturing high-level semantics. In this thesis, we delve into the limitations of mainstream models (including RNN, CNN, and BERT) when applied to program analysis. This thesis explores the potential of deep learning in understanding and analyzing the high-level semantics of binary-only programs, when appropriate deep neural model architectures and features are selected. The thesis tackles three selected security challenges that require comprehension of a program's high-level semantics and attempts to address them using deep learning-based approaches. The first challenge pertains to algorithm inference in Reverse Engineering (RE). RE is a critical task performed by security professionals for various purposes. However, the complexity and laboriousness of ransomware, have posed significant challenges to experts in the field. In response, this study explores the feasibility of incorporating deep learning techniques to assist the ransomware RE. To tackle the specific challenges of encryption loop localization, our approach employs two learning strategies. Firstly, we identify and utilize code-obfuscation-resilient and encryption-algorithm-agnostic features, including $K$-complexity and operations that yield equiprobable outputs. Secondly, we carefully select a neural network architecture capable of extracting informative features. By validating the effectiveness of our approach in automatically recognizing encryption code during ransomware RE, this study proves the feasibility of the deep-learning-assisted semantic-level RE. The second delves into the identification of security-critical non-control variables in software protection. As control-flow protection methods get widely used, it is difficult for attackers to corrupt control-data to build attacks. Instead, data-oriented exploits, which modify non-control data for malicious goals, have been demonstrated to be possible and powerful. To defend against data-oriented exploits, the first fundamental step is to identify non-control, security-critical data. In this work, we investigate the application of deep learning to critical-data identification. This work provides an in-depth understanding about how to effectively learn data and control dependence features from the dynamic execution trace, and a detailed explanation about why many other baselines of applying deep learning would fail to solve this problem. The third focuses on silent buffer overflow detection in vulnerability discovery and analysis. A software vulnerability could be exploited without any visible symptoms. Although such silent program executions could cause very serious damage, analyzing silent yet harmful executions is still an open problem when no source code is available. In this work, we propose a graph neural network assisted data flow analysis method for spotting silent buffer overflows in execution traces. The new method combines a novel graph structure (denoted DFG+) beyond data-flow graphs, and a modified Relational Graph Convolutional Network as the GNN model to be trained. The evaluation results show that a well-trained model can be used to analyze vulnerabilities in execution traces (of previously-unseen programs) without support of any source code.