LEARNING IN EXTREME CONDITIONS: ONLINE AND ACTIVE LEARNING WITH MASSIVE, IMBALANCED AND NOISY DATA

Open Access
Author:
Ertekin, Seyda
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
March 26, 2009
Committee Members:
  • C Lee Giles, Dissertation Advisor
  • C Lee Giles, Committee Chair
  • Leon Bottou, Committee Member
  • Raj Acharya, Committee Member
  • Jia Li, Committee Member
  • Tracy Mullen, Committee Member
Keywords:
  • Machine Learning
  • Online Learning
  • Active Learning
  • Support Vector Machines
  • Classification
  • Data Mining
Abstract:
This thesis addresses improving the performance of machine learning algorithms with a particular focus on classification tasks with large, imbalanced and noisy datasets. The field of machine learning addresses the question of how best to use experimental or historical data to discover general patterns and regularities and improve the process of decision making. However, applying machine learning algorithms to very large scale problems still poses challenges. Additionally, class imbalance and noise in the data degrade the prediction accuracy of standard machine learning algorithms. The main focus of this thesis is designing machine learning algorithms and approaches that are faster, data efficient and less demanding in computational resources to achieve scalable algorithms for large scale problems. This thesis addresses these problems in active and online learning frameworks. The particular focus of the thesis is on Support Vector Machine (SVM) algorithm with classification problems, but the proposed approaches on active and online learning are also well extensible to other widely used machine learning algorithms.<br><br> This thesis first proposes a fast online Support Vector Machine algorithm (LASVM) that has an outstanding speed improvement over the classical (batch) SVM and other online SVM algorithms, while preserving the classification accuracy rates of the state-of-the-art SVM solvers. The ability to handle streaming data, speed improvement in both training and recognition and the demand for less memory with the online learning setting enable SVM to be applicable to very large datasets. The effectiveness of LASVM and active learning in real world problems is assessed by targeting the name disambiguation problem in CiteSeer's repository of scientific publications. The thesis then presents an efficient active learning framework to target the expensive labeling problem for producing training data. The proposed method yields an efficient querying system and removes the barriers of applying active learning to very large scale datasets due to the high computational costs. We then point out that even when the labels are readily available, active learning can still be used to reach out to the most informative instances in the training data. By applying active sample selection and early stopping in the online SVM, we show that the algorithm can reach and even exceed the prediction accuracies of baseline setting of LASVM with random sampling. Our experimental results also reveal that active learning can be a highly effective method for dealing with the class imbalance problem. We further propose a hybrid method of oversampling and active learning to form an adaptive technique (named VIRTUAL) to efficiently resample the minority class instances in imbalanced data classification. Finally, we propose a non-convex online SVM algorithm (LASVM-NC) based on the Ramp loss, which has strong ability of suppressing the influences of outliers in noisy datasets. Then, again in the online learning setting, we propose an outlier filtering mechanism that approximates non-convex behavior in convex optimization (LASVM-I). These two algorithms are built on an online SVM solver (LASVM-G) which leverages the duality gap to obtain more trustworthy intermediate models. Our experimental results show that the proposed approaches yield a more scalable online SVM algorithm with sparser models and less computational running time both in the training and recognition phases without sacrificing generalization performance. In the end, we also point out the relation between the non-convex behavior in SVMs and active learning.