Large Scale Author Name Disambiguation in Scholarly Databases
Open Access
- Author:
- Menon, Arjun
- Graduate Program:
- Computer Science
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- March 15, 2021
- Committee Members:
- C Lee Giles, Thesis Advisor/Co-Advisor
Bhuvan Urgaonkar, Committee Member
Chitaranjan Das, Program Head/Chair - Keywords:
- Author Name Disambiguation
Machine Learning
Scholarly Database
Clustering
Distributed System - Abstract:
- Author name ambiguity in a digital library can potentially affect the correctness of records associated with a person and invalidate findings of research based on automated mining of data from the library. Author Name Disambiguation(AND) is the problem of correctly linking individuals to their corpus of work. The author name ambiguity problem can be described in the context of an academic digital library as follows: two or more authors can happen to share the same name (homonyms), thus making a digital library return query results that are not pertinent to a target author. Additionally, an author may have two or more name variants (synonyms) due to reasons not limited to marriage induced name changes or inconsistent use of middle names. The problem presents itself in two primary ways: an individual may be identified as two or more distinct authors (thus presenting the need for splitting or record disambiguation), or, two or more authors may be identified as a single author (thus presenting the need for merging or record conflation). A number of automated, unsupervised name disambiguation solutions exist based on the pairwise similarities of bibliographic candidate records and co-authorship patterns. We implement AND for a live scholarly database capable of processing records at scale on both a batch and online basis. We present an end-to-end solution that performs blocking of authors to process at scale, computes similarity feature vectors for classification and present a clustering mechanism to merge records. A key contribution of the work is the asynchronous workflow without compromising on consistency and the ability to extensively configure system resources and set limits so that primary faculties of the scholarly database are not affected by disambiguation workloads. The performance of the system is qualitatively evaluated and findings are discussed.