Mixed Integer Programming, Whitening, and Functional Data Analysis: Improving Feature Selection in “Omics” Research

Open Access
- Author:
- Kenney, Ana
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- July 08, 2021
- Committee Members:
- Kateryna Makova, Outside Unit & Field Member
Matthew Reimherr, Co-Chair & Dissertation Advisor
Francesca Chiaromonte, Chair & Dissertation Advisor
Ethan Fang, Major Field Member
Ephraim Hanks, Professor in Charge/Director of Graduate Studies - Keywords:
- Functional Data Analysis
Mixed Integer Programming
Second Order Conic Programming
Whitening
Statistical Genomics
Robust Regression
Oracle Properties
Feature Screening
Differential Privacy
Human Microbiome
Feature Selection - Abstract:
- Contemporary sciences are producing ever larger and more complex data sets. This requires new sets of tools for regression analysis. In particular, in supervised problems in ``Omics" and biomedical applications, feature selection is critical both for prediction and for interpretation. However, multiple challenges arise due to high dimensions, interdependencies and collinearity among features as well as observations, contaminated units (outliers), weak and/or uneven regression signals, complex responses — e.g., biometric or disease phenotypes measured longitudinally, and privacy concerns in the handling of sensitive information. These challenges provide the major motivation for our work on how to utilize the L_0 norm for exact best subset selection, a feature de-correlation pre-processing step called whitening, and functional data analysis for better feature/response representation. Our proposals also showcase the benefits of integrating modern optimization tools and statistics to develop flexible and practical procedures. Our contributions are as follows: (1) We develop a pipeline for biomarker detection that exploits longitudinal outcomes through Functional Data Analysis (FDA), providing increased statistical power to under-sampled studies. (2) We provide a computationally efficient implementation of Mixed Integer Programming (MIP) for exact best subset selection and propose an extension to simultaneous outlier detection. This flexible framework is applicable to a wide variety of areas - as demonstrated through a study connecting childhood obesity to the human microbiome. (3) We utilize Second Order Conic Programming (SOCP) in a generalized form of whitening, ORTHOMAP, to de-correlate highly collinear features with minimal loss in interpretation for subsequent analysis. We demonstrate its utility in both scalar and functional regression settings through two real data applications concerning diabetes and COVID-19 mortality curves in Italy respectively. (4) We propose privacy preserving functional principal components, one of the most common tools in FDA, and demonstrate its applicability in longitudinal biomedical studies.