A Methodology of Machine Learning in Automated Entity Summarization

Open Access
Chonde, Seifu John
Graduate Program:
Industrial Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
November 20, 2015
Committee Members:
  • Soundar Kumara, Dissertation Advisor
  • Karl Todd Mueller, Committee Member
  • Joey W Storer, Special Member
  • Conrad S Tucker, Committee Member
  • Paul M, Griffin, Committee Member
  • Vasant Gajanan Honavar, Committee Member
  • text mining
  • network science
  • diversity
  • science of science
  • entity summarization
Conducting background research is a time consuming, yet important, part of every research endeavor. It includes compiling relevant sources, reading those sources, and comprehending the information. We find that this information scales rapidly in the current information age. The use of automated text summarization, among other techniques (e.g., search engines), helps to improve efficiency in exploring data by distilling large amounts of information that is becoming prevalent. For the purpose of summarizing entity and topic interaction in large information stores, in this dissertation a methodology of automatic entity summarization is presented. The methodology is broken into three steps: Reading, Assembly, and Interpretation. In the Reading step, the appropriate information sources are determined and, subsequently, the interrelated entities are extracted within each source. Four inputs are necessary in this step: a topic extraction algorithm, a named entity recognition algorithm, information sources, and property information for the entities. In the Assembly step, the relationships between entities across sources is represented through knowledge networks. A trimodal weighted co-occurrence hypergraph is presented and then projected into unimodal and bimodal graphs. Finally, in the Interpretation step, graph analytics are presented to summarize the graphs. A novel diversity heuristic is derived based on information entropy to compare information diversity in different streams of literature over time. To test the methodology, three experiments were conducted. Data from the PubMed Central Open Access Subset, which consisted of 740,418 journal citations in 4,404 journals, was downloaded on July 14, 2014. The first experiment addressed the relationship between the size of the information network and the number of files input into the methodology. It was found that a power law relationship exists, as shown in linguistic theory. The second experiment addressed the validity of the methodology in extracting meaningful connections and predicting the top chemicals using two gold standards. Results indicate that the methodology can be used to determine the top chemicals and that meaningful connections are those with the highest weight in the network. Finally, the diversity heuristic was used in the third experiment to empirically compare the diversity of information in a stream of articles relating to honeybee research to the diversity of information in a stream of articles relating to diabetes research. It was seen that the existing heuristic provides quite noisy results when applied to information networks and that the new heuristic has better asymptotic properties. This research is among the first efforts towards building improved literature-based discovery algorithms that are capable of automating the hypothesis generation process in large literature sets.