High Performance Record Linkage

Open Access
Author:
Kim, Hung-sik
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
April 26, 2010
Committee Members:
  • Dongwon Lee, Dissertation Advisor
  • Dongwon Lee, Committee Chair
  • Padma Raghavan, Committee Member
  • Bhuvan Urgaonkar, Committee Member
  • Ae Ja Yee, Committee Member
Keywords:
  • record linkage
  • parallel computing
  • information retrieval
  • data cleaning
  • copy detection
Abstract:
In current world, the immense size of a data set makes problems in finding similar/identitcal data. In addition, the dirtiness of data, i.e. typos, missing/tilting information, and additional noises usually occurred by careless editing or entry mistakes, makes further difficulty to identify entity-belongs. Therefore, we focus on the faster detection of data referring the same real-world entity from a large size data set under the error prone environments, while the high accuracy of detection is maintained. In this thesis, we study high-performance linkage algorithms using four different applications. First, we introduce the image linkage algorithm to find near-duplicate images with similar characteristics by bridging two seemingly unrelated fields – Multimedia Information Retrieval and Biology. Under this idea, we study how various image features and gene sequence generation methods affect the accuracy and performance of detecting near-duplicate images. Second, we develop the video linkage algorithm using record linkage methods to detect copied videos from a large multi-media database or sites such as YouTube and Yahoo Videos. The utilization of video characteristics is reflected to the hierarchical structure of the proposed algorithms. In addition, the uses of pipe-lined linkage structures accelerate the speed further. Third, the parallel linkage algorithm, the parallelization of the data linkage frame, is introduced, when slow but optimal sequential linkage frames occur where iterative matching operations apply to clean and merge dirty sets. Any data matching functions can be adapted to the proposed parallel framework because a data linkage function is considered as a black box in the parallel scheme. Finally, we introduce a hashed linkage structure based on the locality sensitive hashing (LSH) algorithm. By remedying the poverty of a basic LSH structure to suit linkage problems, the proposed hashing structure reduces the precessing time tremendously comparing to the conventional LSH structures.