Trustworthy Unsupervised and Supervised Learning
Restricted (Penn State Only)
- Author:
- Zhang, Lixiang
- Graduate Program:
- Statistics
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- May 26, 2022
- Committee Members:
- Lin Lin, Major Field Member & Dissertation Advisor
Jia Li, Chair & Dissertation Advisor
Matthew Reimherr, Major Field Member
Emily Davenport, Outside Unit & Field Member
Ephraim Hanks, Professor in Charge/Director of Graduate Studies - Keywords:
- Trustworthy Machine Learning
Clustering Uncertainty Assessment
Multi-view Stability-based Clustering
Enhancing Interpretability of DNN
Robust Surrogate Model for Uncertainty Quantification - Abstract:
- Decision-making in many fields that have a profound impact on people's lives, such as healthcare, economics, and social science, is increasingly data-driven. Many supervised or unsupervised machine learning tools have been developed to achieve great performance in those fields. Nevertheless, more and more researchers start to question the trustworthiness of those tools. Many problematic phenomena occur and cannot be properly addressed if people merely use the tools as a black box, especially for the high-stakes decision-making. Beyond the average good performance, trustworthiness is now attracting growing attention. Nowadays, trustworthiness has different focuses in different areas, which consists of some good attributes when facing system errors or human disturbances, such as safety, stability, interpretability, fairness, and privacy. We mainly approach this topic on two aspects, stability and interpretability. This dissertation consists of four projects. The first and second projects are about trustworthy unsupervised learning. In the first project, we develop a toolkit called Covering Point Set (CPS) analysis to quantify clustering uncertainty at the levels of individual clusters and overall partitions, and it can be easily integrated into the existing pipelines of cluster analysis for biomedical data. Applying to three usage scenarios for biomedical data, we demonstrate that CPS analysis is more effective for evaluating the uncertainty of clusters compared to existing methods. In the second project, we develop a novel method for multi-view data clustering, namely covering point set merge (CPS-merge) analysis, without pooling data or concatenating variables across the views. The main idea is to maximize clustering stability by merging clusters formed by the Cartesian product of clustering labels acquired in individual views. Our method also quantifies the contribution of each view to the formation of any cluster. The method can be readily applied and incorporated with existing clustering pipelines because the algorithm adopted for any view is unrestricted. This flexibility, lacking in many multi-view clustering methods, enables us to leverage advanced single-view clustering algorithms. Importantly, our method accounts for both consensus and complementary effects between different views. In contrast, existing ensemble algorithms focus on seeking a consensus for clustering results obtained in different views, implicitly assuming that these results are variations of one clustering structure. We demonstrate the advantages of the new approach by experiments on single-cell datasets and make comparisons with a few state-of-the-art methods. The third and fourth projects are about trustworthy supervised learning. In the third project, we propose a neural network called Variable-block tree Net (VtNet) with demonstrated advantages over the commonly used multi-layer perceptron (MLP) and deep belief network (DBN) for biomedical data classification. VtNet exploits the causal relationships among variables such that the neural network architecture depends on the directed causal graph of variables. The data-dependent architecture reduces model complexity compared with MLP while achieving high accuracy. Moreover, VtNet enables us to easily quantify the importance of variables. Similarly, as with Random Forest (RF), this significance score is a byproduct of the classification model. Hypothesis tests are conducted to show that variables of higher significance scores influence classification more strongly. Experiments demonstrate that VtNet not only achieves state-of-the-art accuracy but also provides useful insights into the roles of variables. In the fourth project, we propose a novel application of Deep Neural Network (DNN) adversarial learning methods to surrogate models of simulators. In science and engineering, many computational models with high complexity, namely simulators, have been used to simulate physical or biological processes. DNN has gained popularity as a surrogate model for state-of-the-art emulation accuracy. However, DNN is prone to error when input data are perturbed in particular ways, namely adversarial attacks. This is an issue largely ignored by researchers using emulation models. We show the severity of this issue in terms of emulation accuracy and uncertainty quantification through empirical studies and hypothesis testing. Furthermore, we propose a computationally efficient adversarial training method and demonstrate that a DNN surrogate mode trained by this method achieves greatly improved robustness without compromising emulation accuracy.