DESIGN AND IMPLEMENTATION OF A MULTI-STAGE PIPELINE FOR LARGE SCALE EXTRACTING, CLUSTERING AND INGESTION OF ACADEMIC DOCUMENTS FOR CITESEERX
Restricted (Penn State Only)
Author:
Angadi, Manoj Kumar
Graduate Program:
Computer Science and Engineering
Degree:
Master of Science
Document Type:
Master Thesis
Date of Defense:
March 23, 2023
Committee Members:
Chitaranjan Das, Program Head/Chair C Lee Giles, Thesis Advisor/Co-Advisor Bhuvan Urgaonkar, Committee Member
Keywords:
CiteSeer Extraction Clustering Ingestion LSH BM25 NGX Elasticsearch Python Pipeline EIS Citation index PDF Document Server
Abstract:
CiteSeer is the world’s first digital library search engine (DLSE). CiteSeer was originally developed in the year 1997 to serve big data collection and search of computer science documents to millions of users worldwide. The objective of CiteSeer is to enhance the spread of scientific literature and enhance the accessibility, comprehensiveness, usability, timeliness, efficiency, and cost-effectiveness of scientific and academic knowledge.
To address the growing complexity, CiteSeerX replaced the old CiteSeer design, but it still faced some issues. In order to meet the needs of the research community and overcome the challenges encountered by the original system, a new architecture and data model were developed for CiteSeerX, also referred to as Next Generation CiteSeerX (NGX). These updates were crucial for maintaining the longevity of the CiteSeer legacy.
The number of scholarly documents published every year has been growing steadily leading to many challenges. We discuss in detail the shortcomings of the previous work and discuss in detail how the current work addresses those challenges.The modified architecture proposed in this document aims to address the shortcomings of NGX architecture by introducing a scalable multiple-stage pipeline architecture for extracting, clustering and ingesting academic documents.