AN ANALYTICAL FRAMEWORK FOR THEORETICAL ANALYSES IN CLASSIFIER ENSEMBLES AND A STUDY OF ISSUES IN CLUSTER VALIDATION FOR GENOMIC DATA
Open Access
- Author:
- Narasimhamurthy, Anand
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- April 15, 2006
- Committee Members:
- Raj Acharya, Committee Chair/Co-Chair
Piotr Berman, Committee Member
Dr Jia Li, Committee Member
Dr Rajeev Sharma, Committee Member
Prof Rangachar Kasturi, Committee Member - Keywords:
- classifier combination
classifier ensembles
majority voting
diversity in classifier ensembles
cluster validation
microarrays
network flow - Abstract:
- Classification and clustering subsume a large number of pattern recognition tasks. The contribution of this work is two-fold. The first part relates to classification, more specifically, to classifier ensembles (multiple classifier systems) for binary classification (two-class) problems. In the second part of this work, we explore some of the issues in cluster validation as relates to genomic data. Classifier ensembles have proved to be promising and useful in various applications. The basic idea is to build a team of classifiers and combine their outputs in order to obtain a more 'robust' classification, as opposed to relying on the output of a single classifier. The outputs of the classifiers could be combined in a number of ways. Majority voting is a simple yet useful combination scheme. Our contribution in the area of multiple classifier systems includes the formulation of the problem of computing the upper and lower bounds of majority voting accuracy for an ensemble of binary classifiers as a linear program (LP). The resulting analytical framework can be used for performing a variety of analyses related to voting. Diversity and complementarity are considered as desirable properties in an ensemble of classifiers, however there is no widely accepted characterization of these concepts, thus making an objective evaluation difficult. Many of the measures defined in the literature are formulated in terms of correct/incorrect classifications, these are referred to as error-diversity measures. We show that the analytical framework mentioned above can be used effectively to evaluate error-diversity measures and explore whether there is a useful relationship between the selected diversity measures and the ensemble accuracy. Next, we explore some of the issues in cluster validation in the context of microarray data. Clustering is often an important first step in the analysis of genomic data, cluster validation is an important step in cluster analysis. We assess the suitability of standard cluster validation techniques for microarray data. Often an important goal in clustering genomic data is to group genes based on underlying biologically relevant criteria such as functions. It is often of interest to compare a clustering result with an external clustering, for instance comparing the grouping of genes obtained by applying a clustering algorithm on microarray data against a reference grouping i.e. a ``gold standard'. We propose a measure for the distance between two membership matrices and suggest when this could be a suitable choice for the above purpose. We use standard network flow algorithms for finding the measure. We also consider related theoretical problems and show how they can also be formulated as network flow problems.