Probabilistic Techniques for Metagenomic Clustering and Intrusion Detection

Jha, Manjari

Probabilistic Techniques for Metagenomic Clustering and Intrusion Detection

Open Access

Author:: Jha, Manjari
Graduate Program:: Electrical Engineering
Degree:: Doctor of Philosophy
Document Type:: Dissertation
Date of Defense:: June 13, 2018
Committee Members:: Raj Acharya, Dissertation Advisor/Co-Advisor
Kultegin Aydin, Committee Chair/Co-Chair
David Jonathan Miller, Committee Member
Vishal Monga, Committee Member
Mary Poss, Outside Member
Keywords:: metagenomics
intrusion detection
clustering
unsupervised
Abstract:: This thesis focuses on developing probabilistic models for the analysis of diverse datasets using unsupervised clustering techniques. Primarily, we focus on two main fields: the clustering of DNA reads derived from a metagenomic sample, and the separation of attacks from normal connections in a collection of network connection records. Metagenomics involves the analysis of genomes of microorganisms sampled directly from their environment. Next Generation Sequencing allows a high-throughput sampling of small segments from genomes in the metagenome to generate reads. To study the properties and relationships of the microorganisms present, clustering can be performed based on the inherent composition of the sampled reads for unknown species. We propose a two-dimensional lattice based probabilistic model for clustering metagenomic datasets. The occurrence of a species in the metagenome is estimated using a lattice of probabilistic distributions over small sized genomic sequences. The two dimensions denote distributions for different word sizes and the distribution of groups of words respectively. The lattice structure allows for additional support for a node from its neighbors when the probabilistic support for the species using the parameters of the current node is deemed insufficient. We test our algorithm on simulated metagenomic data containing bacterial species with known ground truth and observe more than 85% precision. We also evaluate our algorithm on an in vitro-simulated bacterial metagenome and on human patient data, using ground truth from BLAST, and show a better clustering than other algorithms even for short reads and varied abundance. Secondly, we work on developing a model for identifying intrusions in a computer network, inspired by the mechanisms for defense used in the immune system. The immune system is built to defend an organism against both known and new attacks, and functions as an adaptive distributed defense system. Artificial Immune Systems abstract the structure of immune systems to incorporate memory, fault detection and adaptive learning. We propose an immune system based real time intrusion detection system using unsupervised clustering. The model consists of three layers: a probabilistic model based T-cell algorithm which identifies possible attacks, a B-cell model which uses the inputs from T-cells together with feature information to confirm true attacks, and a damage signal generating Antigen Presenting Cell layer. The algorithm is tested on the KDD 99 data, where it achieves a low false alarm rate while maintaining a high detection rate. This is true even in case of novel attacks, which is a significant improvement over other algorithms.

Tools