An Architecture for Multimodal Information Extraction from Scholarly Documents

Open Access
Ray Choudhury, Sagnik
Graduate Program:
Information Sciences and Technology
Doctor of Philosophy
Document Type:
Date of Defense:
June 26, 2017
Committee Members:
  • Clyde Lee Giles, Dissertation Advisor
  • Clyde Lee Giles, Committee Chair
  • Prasenjit Mitra, Committee Member
  • James Z Wang, Committee Member
  • Daniel Kifer, Outside Member
  • digital libraries
  • information extraction
  • semantic scholar
  • document elements
  • machine learning
  • image [processing
  • vector graphics
A scholarly paper (journal article, conference proceeding) has both unstructured (text) and semi-structured data sources (tables and figures). An experimental figure such as a line graph is generated from a data table that stores the results of an experiment. Typically that data table is not reported in the paper, hence can not be queried directly. Similarly, a scholarly table reports the results of an experiment but is not structured enough to support anything more than a keyword query. This dissertation has two contributions. First, we show methods to reduce these semi-structured data sources to structured content that can support factoid queries such as ``What is the best precision for Imagenet classification task?'' or ``What is the best BLEU score for English to Arabic translation?'' For the scholarly figures, we report an end to end system. First, we report a batch extractor to extract all figures (including vector graphics) and associated metadata from a document with 81\% and 87\% accuracy. Next, we report image processing algorithms to detect compound figures with 82\% accuracy and classify non-compound figures as line graphs or bar charts with 84\% average accuracy. We improve the accuracy for text extraction from raster graphics by 39\% and show algorithms to classify the text inside the plots with an average accuracy of 90\%. The majority of figures in computer science papers are embedded as vector graphics. While previous work has always extracted them as raster graphics, we show methods to extract them in a vector graphics format, which allows us to scalably separate curves in line graphs with 75\% average accuracy. This reduces a line graph to the original data points from which it was generated, allowing the factoid queries. We report a similar architecture for scholarly tables that can reduce the tables to data based triples supporting similar queries. Finally, we show supervised methods to extract scholarly entities from the text of the paper. Specifically, we show that a non-sequential classifier learning the informativeness of a phrase globally and a sequential classifier learning the same utilizing the local context can be combined to improve the accuracy of the process.