Semantic Structuring of Scientific Information in Scholarly Documents

Restricted (Penn State Only)
Alzaidy, Rabah Aboulrahman
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
May 04, 2017
Committee Members:
  • C. Lee Giles, Dissertation Advisor
  • C. Lee Giles, Committee Chair
  • Jesse Barlow, Committee Member
  • Vasant Honavar, Committee Member
  • Bruce A. Desmarais, Outside Member
  • academic search engine
  • digital library
  • bar chart data extraction
  • knowledge base construction
  • automated chart unerstanding
  • taxonomy construction
  • hypernym detection
  • knowledge graph
The continuing growth of published scholarly content on the web ensures the availability of the most recent scientific findings to researchers. Scientific information extraction from these documents into a structured knowledge graph representation facilitates automated machine understanding of the documents. Knowledge graphs model information as entities that are semantically related. Thus, in order to restructure a scholarly document into such a representation, we must identify meaningful entities and relationships that capture the scientific facts within the articles as accurately as possible. In this thesis, we propose a suite of algorithms that are designed precisely for the task of semantic structuring of scientific content in scholarly documents. The thesis addresses two main areas of this problem. The first is concerned with algorithms capable of automatically understanding scientific charts in documents. These charts play an important role in scholarly documents as their content, most often than not, contains key facts that are not mentioned elsewhere in the document text. Scientific charts are an effective tool to visualize numerical data. They appear in a wide range of contexts, from experimental results in scientific papers to statistical analyses in business reports. The abundance of scientific charts in the web has made it inevitable for search engines to include them as indexed content. However, by relying solely on the meta data tags to understanding the charts the facts represented in the charts can not be fully available to information retrieval tools. Unfortunately, most applications, such as search indexing, use image meta-data to describe these charts rather than the information the graphic was initially designed to display. Many studies exist to address the extraction of data from scientific diagrams in order to improve search results. Specifically, the problem of understanding digital charts found, specifically, in scholarly documents and inferring useful textual information from their graphical components is the focus of numerous studies. In our approach to achieving this goal, we attempt to enhance the semantic labelling of scientific charts by using the original data values that these charts were designed to represent. In this work, we describe a framework to automatically read chart data, specifically bar charts, and provide the user with a textual summary of the chart. The chart reading process is fully automated using image processing and text recognition techniques combined with various heuristics derived from the graphical properties of bar charts to extract the original data values. The proposed framework follows a knowledge discovery approach that relies on a versatile graph representation of the chart. This representation is derived from analyzing a chart's original data values, from which useful features are extracted. The data features are in turn used to construct a semantic-graph. To illustrate the portability of the semantic graph structure we use a common natural language application, summary generation. To generate a summary, the semantic-graph of the chart is easily mapped to appropriately crafted protoforms, which are linguistic constructs based on fuzzy logic. We verify the effectiveness of our framework by conducting experiments on bar charts extracted from over 1,000 PDF documents. Our preliminary results show that, under certain assumptions, 83% of the produced summaries provide plausible descriptions of the bar charts. The second focus area of this thesis, is that of automatically understanding a document's scientific text itself. Specifically, we address the problem of constructing a knowledge graph from a set of scholarly documents that contains a large set of concepts found in scientific research content that can not be found in a general purpose knowledge base. Traditional information extraction approaches, that either require training samples or a pre-existing knowledge base to assist in the extraction, can be challenging when applied to such repositories. Labeled training examples for such large scale are difficult to obtain for such datasets. Also, most available knowledge bases are built from web data and do not have sufficient coverage to include concepts found in scientific articles. In this thesis, we aim to construct a knowledge graph from scholarly documents while addressing both these issues. We propose a fully automatic, unsupervised system for scientific information extraction that does not build on an existing knowledge base and avoids manually-tagged training data. We describe and evaluate a constructed knowledge graph resulting from applying our approach to 10k documents.