Towards Designing Accurate Detection Methods for Emerging Cyber Threats

Open Access
- Author:
- Yuan, Lun Pin
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- May 05, 2021
- Committee Members:
- Peng Liu, Co-Chair of Committee
G. Tan, Major Field Member
Sencun Zhu, Chair & Dissertation Advisor
Anna Squicciarini, Outside Unit & Field Member
Guohong Cao, Major Field Member
Chitaranjan Das, Program Head/Chair - Keywords:
- Anomaly Detection
Cybersecurity
Machine Learning
Deep Learning
Cyberthreats
Malware - Abstract:
- Emerging cyber threats such as data breaches, data exfiltration, botnets, and ransomware have caused serious concerns in the security of enterprise infrastructures. The root cause of an emerging cyber threat could be a newly-developed malware or a disgruntled insider; yet, as more and more evasive techniques are available to the adversaries, emerging cyber threats have become more automated and more difficult to be identified by legacy solutions, such as signature-based detection methods. To this end, many researchers have been working on novel detection methods for emerging cyber threats, including (1) detection for zero-day malware before day zero, and (2) detection for habitual anomalies, assuming adversarial activities violate habitual patterns. In this dissertation, we study the limitations and propose three novel detection methods for emerging cyber threats, namely, Lshand, Acobe, and DabLog. In Lshand (Large Scale Hunting for Android Negative-Day malware) we discuss how we can discover undiscovered malware before day zero, which we refer to as negative-day malware. The challenge includes scalability and the fact that malware writers would apply detection evasion techniques and submission anonymization techniques. Our approach is based on the observation that malware development is a continuous process and thus malware variants inevitably will share certain characteristics throughout its development process. Accordingly, Lshand clusters scan reports based on selective features and then performs further analysis on those seemingly benign apps that share similarity with malware variants. We implemented and evaluated Lshand with submissions to VirusTotal. Our results show that Lshand is capable of hunting down undiscovered malware in a large scale, and our manual analysis and a third-party scanner have confirmed our negative-day malware findings to be malware or grayware. In Acobe (Anomaly detection method based on COmpound BEhavior) we address the fundamental limitation of anomaly detection methods that profile users based on single-day and individual-user behaviors. We argue that, without capturing long-term signals and group-correlation signals, the models cannot identify low-signal yet long-lasting threats, and will wrongly report many normal users as anomalies on busy days, which, in turn, lead to high false positive rate. In contrast, our approach takes into consideration long-term patterns and group behaviors. Our approach leverages a novel behavior representation and an ensemble of deep autoencoders and produces an ordered investigation list. Our evaluation shows that Acobe outperforms prior work by a large margin in terms of precision and recall, and our case study demonstrates that Acobe is applicable in practice for cyberattack detection. In DabLog (Deep Autoencoder-Based anomaly detection for discrete event Logs) we address the fundamental limitation of widely adopted anomaly detection for discrete logs. The limitation is that, given a seen sequence of events, most earlier work tried to predict upcoming events, and raise an anomaly alert when a prediction fails to meet a certain criterion. However, such a predict-next-event methodology may not be able to fully exploit the distinctive characteristics of sequences, and hence it may incur many false positives. We argue that it is also critical to examine the structure of sequences and the bi-directional causality among individual events. In contrast, our approach determines whether a sequence is normal or abnormal by analyzing (encoding) and reconstructing (decoding) the given sequence. Our evaluation results show that our new methodology can significantly reduce the numbers of false positives, hence achieving a higher F1 score.