Applying machine learning and NLP techniques to Cyber Security: three selected studies
Restricted (Penn State Only)
- Author:
- Zhang, Lan
- Graduate Program:
- Informatics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- September 18, 2023
- Committee Members:
- Jeffrey Bardzell, Program Head/Chair
Peng Liu, Chair & Dissertation Advisor
Minghui Zhu, Outside Unit & Field Member
Suhang Wang, Major Field Member
Taegyu Kim, Major Field Member - Keywords:
- Malware detection
firmware emulation
deep learning - Abstract:
- In the face of rapid technological progress, cyber attacks have grown significantly in both frequency and severity. To counteract this trend, recent years have witnessed the integration of artificial intelligence (AI) to mitigate the rising tide of cyber threats. This thesis proposal outlines three key research endeavors situated at the intersection of AI and cybersecurity. The first aspect of this thesis delves into ML-facilitated cyberattacks against graph-based malware detection models. While ML-powered malware scanners, particularly those leveraging graph-based detection models, have complemented traditional scanning methods, the vulnerability of these models to attacks has yet to be systematically explored. Current techniques for generating adversarial examples often fail to retain the intrinsic semantics of the original malware. This research proposes an innovative strategy aimed at graph-based models, all the while ensuring the integrity of malware functionality. Through the utilization of reinforcement learning, this approach introduces semantic no-operation instructions (nops) into fundamental blocks of the initial Control Flow Graphs (CFGs), thereby manipulating the model's behavior while preserving the core functionality of the malware. The second facet of this thesis introduces an NLP-driven approach to enhance cyber analysis. The evaluation of Internet of Things (IoT) device security necessitates the testing of firmware on actual microcontroller units (MCUs). However, the inclusion of real peripherals poses challenges in terms of scalability. Consequently, various emulation-based techniques have arisen. Nevertheless, firmware content can sometimes be incomplete or misleading. In response, a novel technique is presented involving the extraction of condition-action pairs from MCU manuals using Natural Language Processing techniques. This extracted knowledge is then employed to develop a real-time emulator utilizing S2E, thereby enabling automated firmware testing. The final contribution of this thesis introduces an NLP-driven method to dynamically identify concurrency bugs in embedded systems. The intricacies of embedded systems, particularly those pertaining to interrupt-level concurrency, demand innovative solutions to ensure reliability and robustness. However, existing methodologies often lack awareness of the underlying hardware states, potentially resulting in erroneously triggered interrupts. To address this, a hybrid approach is proposed, encompassing techniques such as NLP-driven signal extraction and dynamic validation.