Approaches to reduce and integrate data in structured and high-dimensional regression problems in Genomics

Open Access
Liu, Yang
Graduate Program:
Doctor of Philosophy
Document Type:
Date of Defense:
August 04, 2015
Committee Members:
  • Francesca Chiaromonte, Dissertation Advisor
  • Francesca Chiaromonte, Committee Chair
  • Bing Li, Committee Member
  • Runze Li, Committee Member
  • Yu Zhang, Committee Member
  • Mary Poss, Committee Member
  • Data integration
  • Genomics
  • Ordinary least squares
  • Structured data
  • Sufficient dimension reduction
  • Variable selection
Analysis of high-dimensional data has become increasingly important in several fields of the sciences and engineering. This is particularly true for Genomics with its expanding repertoire of high-throughput technologies. For many regression-like analyses, dimension reduction in the predictor space can be very effective. The most commonly used approaches assume that predictors and samples are similar in nature and can simultaneously participate in the reduction. However, recent high-throughput genomic data is often heterogeneous and structured; for instance, both samples and predictors may be labeled based on their origin and/or information available on their nature or function. Exploiting known structure in samples and predictors when performing dimension reduction can be an avenue for integrating data collected through multiple studies and diverse high-throughput platforms. To address this challenge, we propose a novel Sufficient Dimension Reduction (SDR) approach that we call structured SDR, and one methodology to pursuit it, structured Ordinary Least Squares (sOLS), that is effective and parsimonious also when, as is the case in many Genomics applications, the number of available samples is relatively small compared to the number of predictors. sOLS combines ideas from existing SDR literature to merge reductions performed within groups of samples and/or predictors. Importantly, it utilizes a novel a version of OLS for grouped predictors that requires far less computation than other recently proposed groupwise SDR procedures, and provides an informal yet effective variable selection tool in these settings. We demonstrate the performance of sOLS by simulation and present a first application to genomic data. In addition, we extend our approach and methodology to be able to tackle regressions with binary or multivariate responses, as well as regressions with correlated observations. These extensions expand the application scope of structured SDR – e.g. to classification problems and the analysis of spatial data. They, too, are demonstrated through simulations and applications in Genomics and Health Care. This dissertation holds the promise of providing the Genomics community with an effective data reduction and integration approach, and may also have broad applicability to complex data from other scientific fields.