L2-UniFrac

Open Access
- Author:
- Millward, Andrew
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- March 04, 2022
- Committee Members:
- Chitaranjan Das, Program Head/Chair
David Koslicki, Thesis Advisor/Co-Advisor
Paul Medvedev, Committee Member - Keywords:
- UniFrac
Bioinformatics
Comparative Metagenomics
Representative Microbiome
Distance Metric
Differential Abundance - Abstract:
- Within the field of bioinformatics, a rather popular method of quantitatively assessing the differences between microbial samples is the UniFrac metric. This metric assigns a "distance" measure between two samples based on experimental abundances and weights across a phylogenetic tree. Over the past two decades, UniFrac has advanced in several key iterations that primarily focused on improving the performance of the algorithm with reducing execution time as the core focus. Following the introduction of EMDUniFrac, a theoretical expression of EMDUniFrac in terms of L1-norm differences was proposed. This paper outlines a re-expression of the EMDUniFrac algorithm in terms of the L2-norm as well as in terms of L2-norm differences. It also shows the utility of such a metric in enabling the computation of averages with respect to such L2-norm while maintaining natural constraints to allow for more efficient comparisons between different microbial communities. We first outline a mathematical basis for the introduction of L2-normalization on a theoretical level from prior work. Then, using this understanding, we redefine the structure of the method required to preprocess vectors into L2-UniFrac space with respect to shared phylogeny and the L2-norm as well and provide a method to reverse such process. Additionally, we provide a new algorithm to compute the standard L2-UniFrac metric directly without preprocessing, as well as a scheme for obtaining this same value by first pushing vectors into L2-UniFrac space and computing the L2-norm of their difference. Finally, we propose a process to compute the average of samples with respect to the L2-norm. This work then outlines several experiments that serve to evaluate the efficacy of the L2-UniFrac metric. Such tests include verifying similar PCoA and clustering performance on individual samples compared with prior UniFrac metrics, testing for the presence of negative abundances, evaluating the clustering of closely related sample averages with PCoA, analyzing the taxonomic structure of such averages both directly and with KronaTools, and applying differential abundance testing to the representative sample. The results of this experimentation verify that L2-UniFrac performs very closely to prior work such as EMDUniFrac with regards to individual samples and vastly exceeds the performance of prior methods when applied to averaged, representative samples. L2-UniFrac provides a more natural representation of representative samples for entire microbiomes and maintains key natural constraints that are otherwise violated when applying prior metrics to this application. Finally, the averaged vectors provided by L2-UniFrac provide a natural framework to analyze specific OTU discrepancies between microbiomes with results supported by prior literature.