Overcoming the bottleneck of extracting and indexing hundreds of millions of academic papers to support a scholarly big data service: a case study of CiteSeerX
Open Access
Author:
Keesara, Sai Raghav
Graduate Program:
Computer Science and Engineering
Degree:
Master of Science
Document Type:
Master Thesis
Date of Defense:
March 08, 2021
Committee Members:
C Lee Giles, Thesis Advisor/Co-Advisor Bhuvan Urgaonkar, Committee Member Jian Wu, Special Signatory Chitaranjan Das, Program Head/Chair
Keywords:
Information Retrieval Information Extraction Digital Libraries Search Engine Scalability Academic Libraries Elasticsearch
Abstract:
CiteSeerX is one of the world’s first academic digital libraries. After being established in 1998, CiteSeerX has been serving researchers with access to scholarly big data in various scientific domains. It serves about 2 millions hits with around 40-60 concurrent users on a typical day. The system hasn’t been scaling so well with SQL database as the main bottleneck in search performance and lower ingestion throughput. While the previous works have justified the need to migrate to a NoSQL database, the current work realizes it by designing and implementing an Extraction and Ingestion system that is capable of ingesting up to a million academic documents per day and addresses shortcomings of the previous architecture in terms of scalability, modularity, and usability.