Entity Resolution for Large-Scale Databases

Restricted (Penn State Only)
Kim, Kunho
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
May 30, 2019
Committee Members:
  • Clyde Lee Giles, Dissertation Advisor
  • Clyde Lee Giles, Committee Chair
  • Daniel Kifer, Committee Member
  • Rebecca Jane Passonneau, Committee Member
  • Guido Cervone, Outside Member
  • Entity Resolution
  • Author Name Disambiguation
  • Record Linkage
  • Pairwise Classification
  • Clustering
Entity resolution involves the problem of identifying, matching, and grouping the same entities from a single collection or multiple ones of data. Real-world databases often comprise data from multiple sources; hence, this process is an essential preprocessing step for correctly processing queries on a particular entity. An example of entity resolution is finding a person's medical records from multiple hospital records. In entity resolution, there commonly arise two main problems. One is the issue of disambiguation (or deduplication), which involves clustering records that correspond to the same entity within a database. The other problem is record linkage which involves matching records between multiple databases. In this dissertation, we focus on studying entity resolution on large-scale structured data such as CiteSeerX, PubMed and the United States Patent and Trademark Office (USPTO) patent database in several aspects. First, we review our proposed entity resolution framework, and discuss how to apply the framework on two practical problems; inventor name disambiguation on the USPTO patent database and financial entity record linkage. Second, we investigate building a web service to improve ease of using entity resolution results in several scenarios. We define two types of queries---attribute and record-based ones---and discuss how we design the web service to handle those queries efficiently. We demonstrate that our algorithm can accelerate the record-based query by a factor of 4.01 compared to a baseline naive approach. Third, we discuss improving the entity resolution in two directions. One direction is to improve the blocking method to reduce unnecessary comparison to improve scalability on author name disambiguation problems. We show that our proposed conjuctive normal form (CNF) blocking tested on the entire PubMed database of 80 million author mentions efficiently removes 82.17% of all author record pairs. Another direction is to improve accuracy; we study enhancing pairwise classification, which estimates the probability of a pair of records being from the same name entity. Our purposed hybrid method using both structure-aware and global features shows an improvement on mean average precision by up to 7.45% points. Finally, we discuss entity and attribute extraction. Entity extraction is important in terms of improving the input data quality for entity resolution and can also be used to extract useful entities from external sources. In this dissertation, we study the problem of extracting entities for task oriented spoken language understanding in human-to-human conversation scenarios. Our proposed bidirectional LSTM architecture with supplemental knowledge extracted from web data, search engine query logs, prior sentences, and task transfer demnstrates an improvement in F1-score by up to 2.92% compared to existing approaches.