A Multiclass Boosting Classification Method With Active Learning

Open Access
- Author:
- Huang, Jian
- Graduate Program:
- Information Sciences and Technology
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- November 10, 2009
- Committee Members:
- C Lee Giles, Thesis Advisor/Co-Advisor
C Lee Giles, Thesis Advisor/Co-Advisor - Keywords:
- Active Learning
Multiclass Classification
Boosting
Data Mining - Abstract:
- Boosting methods, a class of ensemble learning methods, have been very popular in the machine learning and data mining fields. The classical AdaBoost algorithm only requires the underlying weak learner to perform better than random guessing and has shown empirical success in its resistance to overfitting. Many challenges, however, remain to be tackled. There are a number of theoretical research questions such as how to efficiently handle the multiclass classification setting, as well as how to reduce the impacts of outliers to the ensemble model. From an empirical perspective, scaling up a multiclass boosting algorithm, which requires the training of many underlying weak learners, to a large scale dataset is of practical interest. A novel multiclass classification algorithm Gentle Adaptive Multiclass Boosting Learning (GAMBLE) is proposed to address these issues. The algorithm naturally extends the two class Gentle AdaBoost algorithm to multiclass classification by using the multiclass exponential loss and the multiclass response encoding scheme. Unlike other multiclass algorithms which reduce the K-class classification task to K binary classifications, GAMBLE handles the task directly and symmetrically, with only one committee classifier. We formally derive the GAMBLE algorithm with the quasi-Newton method, and prove the structural equivalence of the two regression trees in each boosting step. To scale up to large datasets, we utilize the generalized Query By Committee (QBC) active learning framework to focus learning on the most informative samples. Empirical results show that with QBC-style active sample selection, faster training time and potentially higher classification accuracy can be achieved by using only the informative fraction of the training instances. GAMBLE's numerical superiority, structural elegance and low computation complexity make it highly competitive with state-of-the-art multiclass classification algorithms.