Name Disambiguation in Academic Publications

Open Access
- Author:
- Treeratpituk, Pucktada
- Graduate Program:
- Information Sciences and Technology
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- May 19, 2009
- Committee Members:
- C Lee Giles, Thesis Advisor/Co-Advisor
Prasenjit Mitra, Thesis Advisor/Co-Advisor
Dongwon Lee, Thesis Advisor/Co-Advisor
Madhu Reddy, Thesis Advisor/Co-Advisor - Keywords:
- digital libraries
machine learning
entity resolution - Abstract:
- In digital libraries, author ambiguities arise when an author use multiple aliases and when more than one author shares the same names. Since substantial amount of queries in digital libraries are author related, such ambiguity can be inconvenient for users. Without author disambiguation, users are required to manually go through search result when they want to find all the articles written by a particular author. Author disambiguation also enables better bibliometric analysis by allowing a more accurate counting and grouping of publications and citations. While many disambiguation algorithms have been proposed, the most successful ones are those that apply machine learning techniques, such as decision trees and SVMs, to learn the domain specific rules. While SVM-based disambiguation methods have been shown to work well, they suffer from typical drawback of SVMs such as long-training time and arbitrary kernel functions. In this thesis, we propose a comprehensive set of similarity profile features to assist in author disambiguation and a novel pair-wise author disambiguation algorithm based on random forests, an ensemble classifier based on decision trees. Our experiments on the Medline and CiteSeer databases show that our random forest method outperforms other previously proposed tech- niques including the SVM-based approaches. Compared with SVMs, the random forest model is substantially faster to train and requires less parameter tuning to achieve good performance. We also provide detail analysis of interactions between different features and the prediction accuracy in the two databases. Finally, we demonstrate how feature selections can be applied to reduce the complexity of the model with little degradation in the disambiguation accuracy.