Predictive Modeling for High Dimensional Longitudinal Data

Restricted (Penn State Only)
- Author:
- Liang, Junjie
- Graduate Program:
- Informatics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- March 17, 2022
- Committee Members:
- Xiang Zhang, Major Field Member
Lin Lin, Outside Unit & Field Member
Suhang Wang, Major Field Member
Vasant Honavar, Chair & Dissertation Advisor
Mary Beth Rosson, Program Head/Chair - Keywords:
- Longitudinal data analysis
longitudinal correlation
cluster correlation
Gaussian Process
Factorization Machines - Abstract:
- Longitudinal studies, which involve repeated observations, taken at irregularly spaced time points, for a set of individuals over time, are ubiquitous in many applications. Predictive models for longitudinal data generally need to take into account the data correlation, i.e., correlation among repeated observations of the individual and/or correlation among groups of individuals. Ignoring either part of the correlation can lead to misleading statistical inferences. It can be non-trivial to choose a suitable correlation structure that reflects the correlations present in the data. The relationships between the variables and outcomes of interest can be highly complex and non-linear. Furthermore, modern applications often call for longitudinal methods that scale gracefully with increasing number of variables and millions of data points. The target for this dissertation is to address the challenges in longitudinal data analysis using machine learning and representation learning approaches. Specifically, our work is dedicated to redesign the state-of-the-art longitudinal models to fit in the large-scale, high-dimensional longitudinal settings. We focus on improving the mixed effects models and non-parametric models by answering the following research questions: (i) How can we design mixed effects models to handle longitudinal data with thousands of variables and automate the selection between fixed and random effects? (ii) How can we design non-parametric models to handle longitudinal data with time-varying and time-invariant effects and automate the discovery of complex correlation? (iii) How can we design non-parametric models to handle longitudinal data with outcomes that could show state transitions, abrupt discontinuities and complex correlation? Against this background, this dissertation investigates two lines of approaches, Factorization Machines and Gaussian Process. We tackle both the theoretical and practical challenges in adapting these approaches to longitudinal settings. For each proposed model, we explore provably efficient algorithm to improve its applicability over high-dimensional data.