Computational methods for dissecting the genetic basis of neurodevelopmental disorders

Open Access
- Author:
- Manickavasagam Pounraja, Vijay Kumar
- Graduate Program:
- Bioinformatics and Genomics (PhD)
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- October 05, 2021
- Committee Members:
- Naomi Altman, Major Field Member
Vasant Honavar, Outside Field Member
Yifei Huang, Major Field Member
Santhosh Girirajan, Chair & Dissertation Advisor
Reka Albert, Outside Unit Member
George Perry, Program Head/Chair - Keywords:
- exome-sequencing
CNV
machine learning
rare variant genetics
genomics
neurodevelopmental disorders
random forest
apriori algorithm
rare combinations
combinatorial analysis - Abstract:
- The genetic basis of neurodevelopmental disorders such as autism and schizophrenia is complex and winding. Unlike mendelian disorders that share causal relationships with one or more genes, complex disorders are the result of composite effects exerted by a wide number and variety of genetic variants and their interactions. Studies have implicated hundreds of variants in such disorders, ranging from single base pairs (SNPs) to large chromosomal deletions/duplications (copy-number variants or CNVs). Furthermore, the assortment of phenotypes manifested by the carriers of a specific variant can be highly heterogeneous. This many-to-many non-deterministic relationship between genotypes and phenotypes makes precise detection and attribution of the role of specific rare genetic variants towards specific phenotypes a challenging yet important problem to be addressed. In this dissertation, I present computational approaches that improve the current state of rare variant detection and interpretation, to facilitate reliable clinical diagnoses. The outline of specific projects are as follows, Specific phenotypic effects of complex diseases can be effectively attributed to rare CNVs only if they could be reliably detected. However, current methods to detect CNVs using targeted exome sequencing data is prone to high false-positive rates, making the choice of method a major determinant of subsequent inferences in any given study. We show that the prevalent approach to take the majority vote among multiple CNV prediction algorithms to select high-confidence CNVs is inadequate and propose an alternative binary machine-learning classifier to improve both reliability and yield. Identification of clinically relevant genes/variants could become more targeted if the genetic basis of neurodevelopmental syndromes could be inferred indirectly from the assortment of phenotypes manifested in affected individuals. A recent study showed that the carriers of mutations in genes RAI1 and TCF20 that share similar protein domain composition exhibit similar assortment of phenotypes. We use this interesting observation to hypothesize that similar relationships must exist across the genome. However, given the combinatorial complexity involved in searching for more than two entities, we propose a generalized model in which genes with similar domain composition lead to similar set of phenotypes. While quantifying the contributions of specific rare variants towards complex diseases is a commonly studied problem, evaluating the contributions of rare variant combinations is not well-explored. Furthermore, oligogenic models of diseases serve as useful frameworks to understand diseases, but actively deploying them to screen for rare variant combinations that influence specific phenotypes is difficult due to two reasons. First, rare event combinations are rarer, warranting large sample sizes to observe even a few recurrences of a combination in a cohort. Second, even when the sample size is large, no existing method can exhaustively analyze variant combinations without making compromises to the interpretability of results. We propose a framework that combines the apriori algorithm and binomial tests to efficiently identify variant combinations that are differentially enriched between two groups. We use this framework to analyze ~6,000 probands with intellectual disability and identify specific gene combinations that affect the phenotype. Our method can be extended to detect higher-order interactions and address a wide range of problems involving rare event combinations, including screening for genes with similar protein domain composition that result in syndromes with similar set of phenotypes. The computational methods contribute collectively towards improvements in variant interpretation, understanding of disease etiology, and clinical diagnoses.