Statistical Learning of Complex Large-Scale Dynamic Systems

Restricted (Penn State Only)
Author:
Lee, Kevin Haeseung
Graduate Program:
Statistics
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
April 18, 2017
Committee Members:
  • Lingzhou Xue, Dissertation Advisor
  • Lingzhou Xue, Committee Chair
  • David R. Hunter, Committee Member
  • Runze Li, Committee Member
  • Nanyin Zhang, Outside Member
  • Mark S. Handcock, Special Member
Keywords:
  • Statistical learning
  • Model-based clustering
  • Dynamic networks
  • Graphical model
  • EM algorithm
  • Variational inference
Abstract:
Due to advances in data collection technologies, large-scale network/graph analysis has been increasingly important in various research fields such as artificial intelligence, business, finance, genomics, physics, sociology and many others. Moreover, recent large-scale network and high-dimensional data show the following common properties which present new challenges for existing statistical methods: i) the data come from different resources and have heterogeneous relations or dependencies; ii) the hidden structures may change over time as relations and dependencies are rarely static; and iii) the data are often collected in large-scale dynamic fashion. Hence, this dissertation focuses on modeling and learning large-scale dynamic networks and exploring the heterogeneous dependencies of high-dimensional data. Dynamic networks modeling provides an emerging statistical technique for various real-world applications. It is a fundamental research question to detect the community structure in dynamic networks. However, due to significant computational challenges and difficulties in modeling communities, there is little progress in the current literature to effectively find communities in dynamic networks. In this dissertation, we introduce a novel model-based clustering framework for dynamic networks, which is based on (semiparametric) exponential-family random graph models and inherits the philosophy of finite mixture modeling. To determine an appropriate number of communities, a composite conditional likelihood Bayesian information criterion is proposed. Moreover, an efficient variational expectation-maximization algorithm is designed to solve approximate maximum likelihood estimates of network parameters and mixing proportions. By using variational methods and minorization-maximization techniques, our methods have appealing scalability for large-scale dynamic networks. Finally, the power of our method is demonstrated by simulation studies and real-world applications. Graphical models have been widely used to investigate the complex dependence structure of high-dimensional data, and it is common to assume that observed data follow a homogeneous graphical model. However, observations usually come from different resources and have heterogeneous hidden commonality in real-world applications. In this dissertation, we introduce a novel regularized estimation scheme for learning a nonparametric mixture of Gaussian graphical models, which explores the heterogeneous dependencies of high-dimensional data. We propose a unified penalized likelihood approach to effectively estimate both nonparametric functional parameters and heterogeneous graphical parameters. We also present a generalized effective EM algorithm to address both non-convex optimization in high dimensions and the label-switching issue. Moreover, we prove both the ascent property and the local convergence hold for our proposed algorithm with probability tending to 1 and also verify the asymptotic properties of the local solution for our model under standard regularity conditions. Using our method, we discover two heterogeneous dependencies in the ADHD brain functional connectivity, and both subpopulations support their respective corresponding scientific findings.