Computational Modeling of Compositional and Relational Data Using Optimal Transport and Probabilistic Models

Open Access
Ye, Jianbo
Graduate Program:
Information Sciences and Technology
Doctor of Philosophy
Document Type:
Date of Defense:
February 16, 2018
Committee Members:
  • James Z. Wang, Dissertation Advisor
  • James Z. Wang, Committee Chair
  • C. Lee Giles, Committee Member
  • Zhenhui Jessie Li, Committee Member
  • Reginald B. Adams, Outside Member
  • Jia Li, Dissertation Advisor
  • Jia Li, Committee Chair
  • machine learning
  • optimal transport
  • optimization
Quantitative researchers often view our world as a large collection of data generated and organized by the structures and functions of society and technology. Those data are usually presented and accessed with hierarchies, compositions, and relations. Understanding the structures and functions behind such data requires using models and methods for specifically analyzing their associated data structures. One of the biggest challenges in achieving this goal is developing a principled data and model framework capable of meaningfully exploiting the structured knowledge of data. Those structures of data include compositional and relational patterns: multiple entities have to interact and group in order to make sense. Although the conventional vector-based data analysis pipelines have become the standard quantitative framework for many fields in sciences and technology, they are not directly applicable to and have several limitations for extracting knowledge from compositional and relational data. The goal of this thesis research is to introduce new mathematical models and computational methods for analyzing large-scale compositional and relational data, as well as to validate the models’ usefulness in solving real-world problems. We begin by introducing several backgrounds, including optimal transport, an old but refreshing topic in mathematics, and probabilistic graphical model, apopular tool in statistical modeling. Particularly, we explain how optimal transport relates to an important modeling concept, a.k.a. matching, in machine learning. Next, we present our work related to computational algorithms of those relational and structural models including a fast discrete distribution clustering method using Wasserstein barycenters, a simulated annealing-based inexact oracle for Wasserstein loss minimization, a Bregman ADMM-based oracle for Wasserstein geodesic classification, and a probabilistic multi-graph model for consensus analysis. Their computational complexities, numerical difficulties, scalability, and accuracy issues are discussed in depth. We apply those computational algorithms to several areas, such as document analysis and crowdsourcing, by treating data as relational quantities from a perspective that has not been fully studied in the literature. We will conclude by discussing challenges in developing suitable methods for compositional and relational data and review more recent work that addresses several past concerns.