Semisupervised Active Learning and Group Anomaly Detection with Unknown or Label-Scarce Categories

Open Access
Author:
Qiu, Zhicong
Graduate Program:
Electrical Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
June 30, 2017
Committee Members:
  • David Jonathan Miller, Dissertation Advisor
  • David Jonathan Miller, Committee Chair
  • John F Doherty, Committee Member
  • George Kesidis, Committee Member
  • Sencun Zhu, Outside Member
Keywords:
  • Active learning (AL)
  • cross entropy
  • inductive bias
  • logistic regression
  • p-value
  • rare classes
  • regularization
  • semisupervised learning (SL).
  • Active learning
  • semi-supervised learning
  • group anomaly detection
  • network intrusion detection
  • Gaussian mixture model
  • Bonferroni approximation
Abstract:
This dissertation makes contributions to two major areas in machine learning, namely, semi-supervised active learning and anomaly detection, with general applicability but with demonstrated application to vehicle tracking and network intrusion detection. In both of these domains, some categories may be rare or unknown, with very few or no labeled samples to start with. For example, in a network intrusion detection system, following a standard statistical anomaly detection (AD) approach, one would train a null hypothesis, characterizing the normal behavior whose data sources are usually captured in a sandbox environment, and flag any sample that deviates from the norm (above a pre-defined threshold) as anomalies. Naively deploying the null model, however, will flood the administrator with too many uninteresting anomalies to further fully investigate. Thus, it makes sense to further discriminate many of the uninteresting anomalies from the truly interesting ones, and develop an active selection strategy to efficiently forward samples for oracle labeling that help to learn to discriminate these groups. Moreover, the interesting anomalies, which may amount to zero-day threats, are often highly skewed and may only manifest on a very small subset of features. Using all features to identify anomalies will not be advantageous because many features may not be informative/discriminative in practice. We try to design a rare category identification and characterization system that can benefit from 1) statistical AD, 2) a semi-supervised active learning based discriminant classifier, with zero weights given to the irrelevant features, and 3) an active sample selection strategy to select the most likely unknown sample, hence efficiently ranking/forwarding interesting anomalies for the administrator. In terms of 1), we also develop a purely unsupervised learning technique to extract group behavior that jointly exhibits anomalousness on a sample and feature subset. In the case of semi-supervised learning, wherein the fundamental idea is to combine the usage of the limited number of labeled training samples and the abundance of unlabeled samples to learn a classifier, we focus on domains where some categories are rare or unknown, with very few or no labeled samples to start with. Unlike the conventional approaches in the mainstream literature, where unlabeled samples are perceived as belonging to one of the known categories and should be leveraged to minimize class posterior uncertainty, we propose a semi-supervised objective that seeks to {\it preserve} the uncertainty among unlabeled samples, both to avoid overtraining and to help efficiently discover unknown classes. Specifically, using Shannon entropy as the measure of uncertainty, we propose to use max entropy regularization (maxEnt), rather than minimum entropy regularization (minEnt). Moreover, our proposed model in a two class classification problem is convex (unlike minEnt) and it has been shown to outperform in a variety of rare category characterization problems, compared to existing approaches. While semi-supervised learning focuses on exploiting the latent structures hidden in the sample distribution, active learning tries to explore the sample space with the least representation. We combine our maxEnt semi-supervised learning with a novel active learning strategy that together efficiently draw from the pool of unlabeled samples for oracle labeling, to achieve the best class discovery (exploration) and classification (exploitation). In the case of anomaly detection, wherein the fundamental idea is to detect significant outliers that deviate from the one class or null hypothesis, we have two contributions to make. First, we propose a group based anomaly detection scheme that identifies the sample and feature subset that jointly manifest the potential anomalous group, with a Bonferroni approximation used to account for multiple testing. Unlike point-wise anomaly detectors, our model has the potential to jointly identify which of the sample and feature subsets are most atypical with respect to the null, thus avoiding most point-wise, superficial outliers and efficiently capturing group anomalies. Second, a singleton and pairwise Gaussian Mixture Models (GMMs) method is proposed as a novel feature representation to characterize atypicality on each single or pairwise dimension, using p-value as score/feature. This avoids the curse of dimensionality and achieves superior performance, compared to other non-informative feature mapping techniques in the literature.