Statistical Learning on High-Dimensional Multi-Source Data

Liu, Shuo Shuo

Statistical Learning on High-Dimensional Multi-Source Data

Restricted (Penn State Only)

Author:: Liu, Shuo Shuo
Graduate Program:: Statistics (PHD)
Degree:: Doctor of Philosophy
Document Type:: Dissertation
Date of Defense:: May 05, 2023
Committee Members:: Bing Li, Professor in Charge/Director of Graduate Studies
Xiang Zhu, Major Field Member
Lan Kong, Outside Unit & Field Member
Matthew Reimherr, Major Field Member
Runze Li, Chair & Dissertation Advisor
Keywords:: Transfer Learning
Multi-View Clustering
High-Dimensional Data
Machine Learning
Abstract:: Multi-source data refers to data that are collected from multiple sources or modalities. With the increasing availability of digital data, multi-source data has been applied to a wide range of applications, including sentiment analysis, object recognition, and recommendation systems. For example, in sentiment analysis, multi-source data has been used to combine text, images, and audio to improve the accuracy of sentiment classification. In recommendation systems, multi-source data has been used to combine data from different sources, such as text, images, and audio, to improve the quality of recommendations. The integration of these diverse and heterogeneous sources of data can provide a more comprehensive understanding of a particular phenomenon or problem. However, there are many challenges for learning multi-source data, such as dealing with heterogeneous data and developing interpretable models. To address these challenges, a number of statistical and machine learning methods have been developed, including multi-view learning and transfer learning. Multi-view learning is a technique that involves the analysis of data from multiple sources or views to learn a common representation of the data. Transfer learning is a machine learning technique that enables the transfer of knowledge from one domain or task to another. This dissertation develops new methods on multi-view learning and transfer learning. Chapter 3 presents a weighted multi-view NMF algorithm, termed as WM- NMF, to conduct integrative clustering of multi-view heterogeneous or corrupted data. We improve the existing multi-view NMF algorithms and propose to perform multi-view clustering by quantifying each view’s content through learning both the view-specific and reconstruction weights. Our proposed algorithm can enlarge the positive effects and alleviate the adverse effects of the important and unnecessary views, respectively. We further demonstrate the competing performance of WM- NMF with regard to clustering performance. Using several datasets, we show that our algorithm significantly outperforms the existing multi-view algorithms in terms of six evaluation metrics. In Chapters 4 and 5, we propose a novel, interpretable, one-step, and unified framework for transfer learning. We first apply it to the high-dimensional linear regression in Chapter 4 and extend it to the generalized linear models in Chapter 5. More specifically, we propose a novel unified transfer learning model by re-defining the design matrix and the response vector in the context of the high-dimensional statistical models. To the best of our knowledge, this is the first work on unified transfer learning. The theoretical results show that it attains tighter upper bounds of the estimation errors than Lasso using the target data only, assuming the target data and source data are sufficiently close to some extent. We also prove that our bound is better than the existing methods, including a tighter minimax rate and a wider range of values for the transferring level. Detecting the transferable data, including the transferable source data and the transferable variables, is a major task in transfer learning. Our unified model is able to automatically identify the transferable variables due to its nature. We develop a hypothesis testing method and a data-driven method for source detection in Chapter 4 and Chapter 5, respectively. To the best of our knowledge, this is the first work for identifying the transferable variables by the model’s nature and the first work to incorporate statistical inference in transfer learning.

Tools