RNA Sequencing and Clinical Data Analysis of Multiple Cancer Types on the National Cancer Institute's Genomic Data Commons
Open Access
- Author:
- Clayman, Carly
- Graduate Program:
- Engineering Science
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- November 07, 2019
- Committee Members:
- Satish Mahadevan Srinivasan, Thesis Advisor/Co-Advisor
Raghu Sangwan, Committee Member
Youakim Badr, Committee Member
Colin Neill, Program Head/Chair - Keywords:
- RNA Sequencing
Genomic Data Commons
TCGA
Prostate Cancer
Survival Analysis
Boruta Random Forest
Feature Selection
Clustering Analysis
Principal Component Analysis
Clinical Data
RNA-Seq
Cancer
Landmark Genes - Abstract:
- Dimensionality reduction methods are used to select relevant features, and clustering performs well when applied to data with low effective dimensionality. This study utilized clustering to predict categorical response variables using Illumina Hi-Seq ribonucleic acid (RNA) Sequencing (RNA-Seq) data accessible on the National Cancer Institute Genomic Data Commons. The dimensionality of the dataset was reduced using several methods. One method selected genes for analysis using a set of landmark genes, which have been previously shown to predict expression of the remaining target genes with low error. Another method selected genes by mining relevant genes from the literature using the DisGeNET package in R. Groups within the dataset were characterized using clinical data to assess whether landmark genes would improve clustering results, compared to established cancer-relevant genes from the literature. Cancer-relevant genes and landmark genes with the most significant correlations with the clinical outcome of overall survival were also assessed in Kaplan Meier survival analysis. In addition, clinical variables as well as the interaction of clinical variables and cancer relevant genes were assessed in survival analysis. While individual gene expression levels and clinical variables were significant predictors of overall survival when assessed separately, the combination of genes along with clinical variable levels provided the most predictive power for overall survival. Important landmark genes selected by the Boruta random forest algorithm resulted in improved clustering performance consistent with high vs. low overall survival, compared to important disease-relevant genes. These findings indicate that multiple cancer types should be further assessed to determine which genes are relevant for cancer outcomes. This study has implications for assessing gene-gene interactions and gene-environment interactions for multiple cancer types.