Extreme Point Approach to Subset Selection and Data Augmentation

Open Access
- Author:
- Banagere Manjunatha, Srikanth
- Graduate Program:
- Electrical Engineering
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- November 02, 2021
- Committee Members:
- Kultegin Aydin, Program Head/Chair
Viveck Ramesh Cadambe, Thesis Advisor/Co-Advisor
Mehrdad Mahdavi, Committee Member - Keywords:
- Machine Learning
Data Augmentation
Convex hull
Isometric property
Classification task - Abstract:
- In the recent years, we have witnessed a drastic increase in the dataset size across various disciplines. It is challenging to store and process this large data. Along with the lack of resources, the existing algorithms are proven to be computationally very expensive as this large data is stored on multiple clusters of machines. Although there are many new algorithms developed focusing on tweaking the architecture of deep neural networks to address the problem of massive data explosion, we focus on a more general problem. We develop a method to obtain a representative subset of the training data, such that the convex hulls of various classes are approximately preserved. We mainly focus on classification problems in machine learning. We first study linear classification models, and then extend our ideas to non-linear classifiers. We develop a randomized extreme point algorithm that approximates the extreme points of a given set of points in high dimensions. We show that the performance of the model on the subsets found using this randomized algorithm are competitive with the performance found on the full dataset. Specifically, for a set of N data points in Rd, our algorithm has a computational complexity of O(MN2) independent of the dimension d. We extend our approach to develop efficient methods for data augmentation part by adding random noise. Data augmentation has proven to make the margin of classification thick, hence making the model more robust. Current methods augment the entire dataset randomly in an isotropic manner. We explore the domain with a focus on augmenting only the extreme points. Specifically, we portray methods to augment the extreme points in specific direction, such that the convex hull of the augmented points along with the extreme points contains the whole dataset. The specific directions are carefully chosen such that the augmented points do not lie within the convex hull of extreme points. Towards this, we demonstrate a further reduction in size while still achieving a similar performance as the solution found on the full dataset. We further extend our ideas to non linear separable dataset, and demonstrate the interpretability property of the chosen extreme points. We demonstrate the effectiveness of our approach for both training and data augmentation on the standard MNIST data set.