APPLICATIONS OF CLUSTERING ALGORITHMS TO ENTITY RESOLUTION AND HUMAN GENOME VARIATION

Open Access
- Author:
- Li, Weiling
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 05, 2018
- Committee Members:
- Raj Acharya, Dissertation Advisor/Co-Advisor
Paul Medvedev, Committee Chair/Co-Chair
Bhuvan Urgaonkar, Committee Member
Wang-Chien Lee, Committee Member
Mary Poss, Outside Member - Keywords:
- data mining
clustering
entity resolution
crowdsourcing
polymorphic HERV-K
human genome variation - Abstract:
- This thesis aims to develop computational approaches for data analysis using clustering techniques. Clustering divides data into clusters that capture the natural structure of the data. In the era of big data, clustering techniques have been playing an important role in a wide variety of fields: data mining, machine learning, social sciences, biology, and so on. This dissertation is focused on clustering applications to two fields: crowdsourced entity resolution and virus sequence variation in human genomes. In the first part of the dissertation (Chapter 2), we have proposed crowdsourcing clustering algorithms to integrate human power into the clustering step in entity resolution. With the information explosion, entity resolution (ER) becomes an important task of disambiguating records referring to the identical entity across different data sources. To reduce the time complexity while maintaining accuracy, blocking techniques are usually introduced in ER process to group similar records into the same block. Some ER tasks are too complex for machine-based techniques. Therefore, human intelligence is integrated with machines. In my first publication, crowdsourced approaches were proposed to blocking with basic clustering techniques, including k-medians and hierarchical clustering. Fundamental human-powered operations are identified for performing human-powered ER tasks. Binary and n-ary human intelligence tasks (HIT) were designed and discussed for cost purpose. The feasibility study validates the two proposed human-powered blocking methods with two different HIT designs. The experimental results show that crowdsourced hierarchical blocking with n-ary HITs can reduce cost with high accuracy. Clustering algorithms are core to many data analyses; therefore, crowdsourced clustering approaches can be extended to other data tasks. The second and major portion of the dissertation (Chapter 3) is focused on developing a mining tool to detect polymorphism in human endogenous retrovirus. The HERV-K family is the youngest of the Human Endogenous retroviruses (HERV) and the only group known to be insertionally polymorphic in humans. Viral transcripts, proteins, and antibody to HERV-K proteins have been detected in many disease states, including cancers, auto-immune, and neurodegenerative diseases. However, attempts to link polymorphic HERV-K with any disease have been frustrated in part because comprehensive iii knowledge on variation in population frequency of HERV-K status at each occupied site is lacking. There are additionally computational challenges in identifying HERV-K insertional polymorphism from short read sequence data, which is needed to generate these frequency data. With a goal of producing a computationally robust and efficient tool to advance understanding of a role HERV-K could have in human disease, we developed a comprehensive approach applicable to any short read whole genome data that is capable of detecting the status - absence, solo LTR, or allelic states of provirus - of all known coding HERV-K in an individual. Our method identifies these states by estimating the proportion of k-mers from any whole genome sequence data matching a set of k-mers unique to each HERV-K. We use the 1000 Genomes Project data to determine global prevalence by population and individual HERV-K burden, applying mixture model-based clustering to account for low depth sequence data characteristic of this data set. We demonstrate that the prevalence of polymorphic HERV-K varies widely among the five super-populations represented in 1000 Genomes Project, with East Asian (EAS) and African (AFR) having the lowest and highest frequency, respectively, of polymorphic HERV-K. Our study identifies population-specific sequence variation for several HERV-K proviruses. In addition, we determine that polymorphic HERV-K co-occur at different frequencies among populations and implement a visualization tool to easily depict the prevalence of combinations of HERV-K in all populations represented in the 1000 Genomes Project. In Chapter 4, we applied our k-mer-based approach proposed in Chapter 3 to two high-depth sequence datasets. First, we applied the approach to a small cancer patient dataset. We discuss the differences of HERV-K prevalence or co-occurrence between patients and the general population, which indicates that HERV-K could contribute to this cancer of T cells. We also use long insert mate pair sequencing data to reconstruct HERVK sequence for investigating HERV-K alleles. Then, we applied our k-mer approach to a large dataset with high-depth sequence data provided by New York Genome Center to recognize patterns of allelic structure. This analysis validates our mixture model perform and confirms allelic structure.