Algseer: An Architecture For Extraction, Indexing And
search Of Algorithms In Scientific Literature
Open Access
Author:
Carman, Stephen H
Graduate Program:
Information Sciences and Technology
Degree:
Master of Science
Document Type:
Master Thesis
Date of Defense:
May 03, 2013
Committee Members:
C Lee Giles, Thesis Advisor/Co-Advisor Dinghao Wu, Thesis Advisor/Co-Advisor John Yen, Thesis Advisor/Co-Advisor
Keywords:
search search engine computer science
Abstract:
Algorithms are ubiquitous in the computer science literature. Is is very rare to see a publication
in computer science that does not introduce or cite algorithms of some kind or another. It is
therefore necessary to extract and index algorithms for search and retrieval. In this thesis we
present AlgSeer, a complete architecture for extracting, indexing, and searching for algorithms.
We present related work in the areas of citation analysis, document element extraction and
speciality search engines that share a similar goal with AlgSeer. We present a complete description
of all the pieces that make up the architecture of AlgSeer and we provide in-depth analysis and
testing of the each function of the system. We extract algorithms from the set of two million
document size CiteSeerX repository and we index and stress test the index of this data. From
this data, we extract over 180 thousand algorithms in an XML format totaling 8.3GB. We also
provide an analysis of the search showing that our index scales to an extent that far surpasses
any realistic prediction of the trac the system would encounter in any practical scenario with
a 600QPM stress test on the index. The query response time never exceeds 5MS.