Characterizing Diseases using Genetic and Clinical variables: A Data Analytics Approach
Open Access
- Author:
- Gollapalli, Madhuri
- Graduate Program:
- Data Analytics (MS)
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- October 19, 2022
- Committee Members:
- Colin Neill, Program Head/Chair
Satish Mahadevan Srinivasan, Thesis Advisor/Co-Advisor
Youakim Badr, Committee Member
Raghu Sangwan, Committee Member
Adrian Sorin Barb, Committee Member - Keywords:
- gene expression values
multinomial logistic regressio
k-means clustering
landmark genes
principal component analysis - Abstract:
- In the era of big data, predictive analytics plays a vital role in precision medicine aimed at personalized patient care. Precision medicine is tailored based on an individual’s genetic makeup, environment, and lifestyle, and offers more accurate medication and health care. In addition to paving the way for precision medicine, emerging data science techniques have been helpful in gaining knowledge in disease progression and prognosis, early intervention, and improved outcomes, eventually resulting in reduced hospital costs. According to the National Cancer Institute, a ’genetic understanding’ is the key to precision medicine in cancer treatment whose goal is to tailor to an individual-specific treatment while considering environmental factors. Therefore, this study takes into consideration the genetic makeup of each patient (sample) along with clinical variables. However, the process of developing analytical models using these data is impacted by the high dimensionality of the data. Since it is established that complex and high-dimensional data can be studied better within low-dimensional embedded spaces, this study focuses on dimensionality reduction techniques to capture the important features. An important step in these techniques is to identify the set of genetic and clinical variables that can serve as predictors for the purpose of diseased tissue classification or disease type prediction. This study expected to identify a subset of the genetic and clinical variables that can predict disease type. To summarize, this study essentially aimed at addressing the following goals: (1) To verify if there is a significant difference in the performance of the landmark and non-landmark genes when used for clustering different types of diseased tissues, and (2) to identify a smaller subset of genetic and clinical variables that can serve as predictors for classifying the diseased tissues or disease types. To accomplish these goals, experiments were carried out on a set of diseased tissues with an objective to understand the differences in the functionality and the predictive capabilities of the genetic (landmark and non-landmark genes) and clinical variables. Using a combination of predictive analytics and statistical techniques, a statistically significant difference was observed in the capabilities of the landmark and non-landmark genes in clustering/classifying the diseased tissue (disease) types. The landmark genes were slightly better and statistically significant in clustering the diseased tissue types when compared to any random set of non-landmark genes. Also, it was clearly evident that both the clinical and the genomic variables were important to predict the diseased tissue types. Application of feature selection techniques on clinical variables identified the variables Morphology, Gender, and Age of Diagnosis, as the top three predictors for predicting the diseased tissue types. Then, in an effort to identify a subset of the genetic variables (genes), the possibility of latent representation of the clusters of both the landmark and non-landmark genes as predictors for the Multinomial Logistic Regression (MLR) classifier was explored. The classification models built using MLR revealed that the principal components of the clusters of the landmark genes are slightly better and statistically significant in classifying the diseased tissues when compared to the principal components of any random set of the non-landmark genes, thus demonstrating that the landmark genes have the capability to serve as a subset of genetic variables and/or as a proxy for clinical variables.