AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE

Open Access
Author:
Dang, Ke
Graduate Program:
Information Sciences and Technology
Degree:
Master of Science
Document Type:
Master Thesis
Date of Defense:
None
Committee Members:
  • C Lee Giles, Thesis Advisor
Keywords:
  • digital library
  • author name disambiguation
  • cross-source document coreference
  • random forests
Abstract:
This thesis deals with two research problems: author name disambiguation in digital library and cross-source document coreference. The first problem comes from the digital library, which is an important technological tool to maintain the information used by users. However, due to the problem of ambiguous author names, users can not distinguish the exact authors of the articles in the digital library. This ambiguity mainly comes from two problems: polyseme, an author name shared by multiple authors, and synonym, an author with multiple name variant. Successfully addressing this ambiguous author name problem can improve the search quality of the digital library when one intends to search a specific author, which happens quite frequently in the digital library. In addition, when one attempts to compute statistics such as the reputation of an author based on his publications, disambiguating the author name enhances the accuracy. In this thesis, we present a comprehensive and synthetically summarization of the author name disambiguation algorithms. We also survey the evaluation datasets and metrics used in the papers. In addition, based on the survey, we suggest several possible directions and interesting ideas in the future. For the cross-source document coreference problem, it is a new and important research direction. Cross-source document coreference deals with the problem of disambiguating the entities in documents of one source to their corresponding identities, if exists, in another source. For example, one source, which is called general source, can be World Wide Web and another source can be the Wikipedia, which is called canonical source. The success of cross-source document coreference can benefit many scenarios. We can automatically construct and enrich the entity information in one source according to its information in another source. Search results of entities in one source can also be grouped by their identities in the other sources. Furthermore, we can compute the reputation of one entity discussed in one source from the information of other sources. In this thesis, we utilize one information extraction tools, OpenCalais, and develop a large number of features, 88 features, to help cross-source document coreference. We also make use of a state-of-the-art machine learning algorithm, random forests, to this new area. In the experiment, we compare the random forests model with three traditional models: Decision Tree, Naïve Bayes and Bayes Network. The experiment results demonstrate that random forests model significantly outperforms Naïve Bayes and Bayes Network by 10.67% and 12.65%. Random forests algorithm also outperforms the Decision Tree by 2.88%.