The Migration of Data and Refactoring of Large Scale Digital Libraries: A Case Study For CiteSeerX
Open Access
Author:
Parsons, Sean
Graduate Program:
Information Sciences and Technology
Degree:
Master of Science
Document Type:
Master Thesis
Date of Defense:
March 26, 2020
Committee Members:
Clyde Lee Giles, Thesis Advisor/Co-Advisor Dinghao Wu, Committee Member Edward J Glantz, Committee Member Jian Wu, Special Signatory Mary Beth Rosson, Program Head/Chair
Keywords:
databases migration big data digital libraries NoSQL search
Abstract:
CiteSeerX is one of the first academic digital libraries in the world and currently contains data on over 10 million academic documents. While the current technical architecture of CiteSeerX has scaled well to this point, there is a need to ingest more papers and utilize modern tools to increase efficiency. NoSQL datastores are examined in this thesis as well as new ways to represent relational data in non-relational databases. Additionally, in this thesis we compare the performance between Elasticsearch and MongoDB for our dataset and we propose a new indexing system for CiteSeerX.