Topics on High-Dimensional Statistical Methods for Compositional Data

Restricted (Penn State Only)
- Author:
- Srinivasan, Arun
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- February 18, 2022
- Committee Members:
- Lingzhou Xue, Chair & Dissertation Advisor
Xiang Zhan, Outside Unit & Field Member
Runze Li, Major Field Member
Francesca Chiaromonte, Major Field Member
Ephraim Hanks, Professor in Charge/Director of Graduate Studies - Keywords:
- compositional data analysis
robust statistics
human microbiome
high-dimensional statistics
covariance estimation
knockoff filter
variable selection
tyler's m-estimator
huber's m-estimator - Abstract:
- The compositional data structure commonly arises when normalization into a set of proportions is necessary for analysis. For example, it is often necessary for count-based data to be normalized by the total counts across all features per sample. Compositional data can be seen across many practical domains including finance, geology, and biology. However, practical data are often riddled with intricacies which can complicate statistical analysis. Modern data is inherently high-dimensional with the number of features of interest p dwarfing the number of available samples n. Further, new statistical methodology analyzing compositonal data must account for a sum-constraint, heavy-tails, and outliers. Ignoring these issues can lead to spurious or misleading results that may jeopardize reproducibility. Thus, the field of high-dimensional compositional data analysis (CoDA) has rapidly grown in recent years to handle this complex data structure. In this dissertation, we propose novel methodology for high-dimensional compositional data in two key areas (1) multivariate regression with finite-sample false discovery rate control and (2) robust dependence structure estimation. We place particular focus in the domain of human microbiome analysis. Both of the aforementioned areas aid in better understanding the complex relationships between microbes within the human body and the affect that microbes have on public health. Finally, we also analyze data from supermarket sales to demonstrate the flexibility of compositional data across multiple domains. The first project focuses on fine-mapping of the human microbiome while controlling the number of spurious false-positives. We present the two-step compositional knockoff filter (CKF), allowing us to explore the association between a scalar response of interest and a large number of microbial taxa with finite-sample false discovery rate control. The second project iterates on the knockoff technique outlined in the first project by performing regression in a setting where there may be multiple outcomes of interest by leveraging information across multiple, potentially correlated, responses to enhance discovery power. The third project develops a joint, robust estimation algorithm for the shape matrix, a general measure of statistical dependency, for compositional data. Finally, in the fourth project, we iterate on this procedure to develop a robust, composition-adjusted thresholding covariance procedure based on Huber-type M-estimation to estimate the sparse covariance structure of high-dimensional compositional data.