MAXIMUM ENTROPY AND IMPROVED ITERATIVE SCALING FOR CLASSIFICATION ON MIXED SPACES, ENSEMBLE CLASSIFICATION AND EXTENSIONS.

Open Access
Author:
Pal, Siddharth
Graduate Program:
Electrical Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
November 15, 2006
Committee Members:
  • David Jonathan Miller, Committee Chair
  • William Kenneth Jenkins, Committee Chair
  • Kwang Yun Lee, Committee Member
  • Hongyuan Zha, Committee Member
Keywords:
  • Distributed Ensemble Classification
  • Ensemble Classification
  • Classification on Mixed feature spaces
  • Transductive learning
  • General Inference
Abstract:
Improved iterative scaling (IIS) is an algorithm for learning maximum entropy (ME) joint and conditional probability models, consistent with specified constraints, that has found great utility in natural language processing and related applications. In most IIS work on classification, discrete-valued "feature functions" are considered, depending on the data observations and class label, with constraints measured based on frequency counts, taken over <I> hard</I> (0-1) training set instances. In this thesis, we consider the case where the training (and test) set consist of instances of <I> probability mass functions</I> on the features, rather than hard feature values. IIS extends in a natural way for this case. This has applications 1) to ME classification on mixed discrete-continuous feature spaces and 2) to ME aggregation of soft classifier decisions in ensemble classification. Moreover, we combine these methods, yielding a method, with proved learning convergence, that jointly performs (soft) decision-level and feature-level fusion in making ensemble decisions. In this thesis we also consider ensemble classification for the case where there is no common labeled training data for jointly designing the individual classifiers and the function which aggregates their decisions. This problem, which we call <I> distributed ensemble classification</I>, applies when individual classifiers operate (perhaps remotely) on different sensing modalities and when combining proprietary or legacy classifiers. The conventional wisdom in this case is to apply fixed rules of combination such as voting methods or rules for aggregating probabilities. Alternatively, we take a <I> transductive</I> approach, optimizing the combining rule for an objective function measured on the unlabeled batch of test data. We propose maximum likelihood (ML) objectives that are shown to yield well-known forms of probabilistic aggregation, albeit with iterative, Expectation Maximization(EM) based adjustment to account for mismatch between class priors used by individual classifiers and those reflected in the new data batch. We also propose an information-theoretic method which generally outperforms the ML methods, better handles classifier redundancies, and which addresses some scenarios where the ML methods are not applicable. This method also well-handles the case of missing classes in the test batch. On UC Irvine benchmark data, all our methods give improvements in classification accuracy over the use of fixed rules when there is prior mismatch.