Interpretable Statistical Learning: From Hidden Markov Models to Neural Networks

Open Access
- Author:
- Seo, Beomseok
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- May 24, 2021
- Committee Members:
- Ephraim Hanks, Major Field Member
Lin Lin, Co-Chair of Committee
Jia Li, Co-Chair & Dissertation Advisor
John Yen, Outside Unit & Field Member
Ephraim Mont Hanks, Program Head/Chair - Keywords:
- Interpretable Machine Learning
Nonlinear Regression and Classification
Unsupervised Variable Selection
Clustering Stability
Interpretable Neural Networks - Abstract:
- Interpretability of machine learning models is important in critical applications to attain trust of users. Despite their strong performance, black-box machine learning models often meet resistance in usage, especially in areas such as economics, social science, healthcare industry, and administrative decision making. This dissertation explores methods to improve 'human interpretability' for both supervised and unsupervised machine learning. I approach this topic by building statistical models with relatively low complexity and developing post-hoc model-agnostic tools. This dissertation consists of three projects. In the first project, we propose a new method to estimate a mixture of linear models (MLM) for regression or classification that is relatively easy to interpret. We use DNN as a proxy of the optimal prediction function so that MLM can be effectively estimated. We propose visualization methods and quantitative approaches to interpret the predictor by MLM. Experiments show that the new method allows us to trade-off interpretability and accuracy. MLM estimated under the guidance of a trained DNN fills the gap between a highly explainable linear statistical model and a highly accurate but difficult to interpret predictor. In the second project, we develop a new block-wise variable selection method for clustering by exploiting the latent states of the hidden Markov model on variable blocks or the Gaussian mixture model. Specifically, the variable blocks are formed by depth-first-search on a dendrogram created based on the mutual information between any pair of variables. It is demonstrated that the latent states of the variable blocks together with the mixture model parameters can represent the original data effectively and much more compactly. We thus cluster the data using the latent states and select variables according to the relationship between the states and the clusters. As true class labels are unknown in the unsupervised setting, we first generate more refined clusters, namely, semi-clusters, for variable selection and then determine the final clusters based on the dimension reduced data. The new method increases the interpretability of high-dimensional clustering by effectively reducing the model complexity and selecting variables while retains the comparable clustering accuracy to other widely used methods. In the third project, we propose a new framework to interpret and validate clustering results for any baseline methods. We exploit the optimal transport alignment and the bootstrapping method to quantify the variation of clustering results at the levels of both overall partitions and individual clusters. Set relationships between clusters such as one-to-one match, split, and merge can be revealed. A covering point set for each cluster, a concept kin to the confidence interval, is proposed. The tools we have developed here will help understand the model behavior of the baseline clustering method. Experimental results on both simulated and real datasets are provided. The corresponding R package OTclust is available on CRAN.