Person Name Disambiguation in the Multicultural and ONline Setting

Open Access
Author:
Treeratpituk, Pucktada
Graduate Program:
Information Sciences and Technology
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
April 27, 2012
Committee Members:
  • C Lee Giles, Committee Chair
  • Prasenjit Mitra, Committee Member
  • James Z Wang, Committee Member
  • Daniel Kifer, Committee Member
  • C Lee Giles, Dissertation Advisor
Keywords:
  • name disambiguation
  • clustering
  • ethnicity classification
Abstract:
With the recent rise in popularity of social network sites, more and more personal information is becoming available online. Since a person’s information is generally available in various formats across multiple sites, there are ever increasing interests in consolidating such personal information from multiple information sources. The goal of person name disambiguation is to group these people references to the corresponding real-world people. These references can range from personal homepages to name mentioned in news articles. This dissertation examines the person name disambiguation problem in three different settings: (1) the name-based person name disambiguation, (2) the metadata-based person name disambiguation and (3) the person name disambiguation in online setting. In the simplest setting, the name-based person name disambiguation, records are disambiguated based purely on personal names. Since personal names are culture-dependent, we propose a novel name matching similarity that take the ethnicity of the names into consideration. More specifically, we propose a name-ethnicity classifier based on multinomial logistic regression and a ethnicity-sensitive name matching similarity based on Smith–Waterman alignment algorithm, where different cost matrices are applied depending on the ethnicity of the names being compared. In the second setting, we examine the person name disambiguation problem where additional information other than personal names is also available. These additional information includes both association information, such as one’s affiliation and social network, and contextual information, such as the content of the document where one’s name is mentioned. We propose a random forest-based method for aggregating multiple types of metadata information in determining whether two person name records or more should be linked. In the last setting, we consider the person name disambiguation problem from the real system perspective, where the number of people references to be disambiguated are not static but ever increasing. Here we propose an online clustering method with constraints for person name disambiguation, where the integrity of each person cluster is continuously enforced. Our experiment shows that our method outperforms the previous static clustering approach without constraints.