Algseer: An Architecture For Extraction, Indexing And search Of Algorithms In Scientific Literature

Open Access
Carman, Stephen H
Graduate Program:
Information Sciences and Technology
Master of Science
Document Type:
Master Thesis
Date of Defense:
May 03, 2013
Committee Members:
  • C Lee Giles, Thesis Advisor
  • Dinghao Wu, Thesis Advisor
  • John Yen, Thesis Advisor
  • search
  • search engine
  • computer science
Algorithms are ubiquitous in the computer science literature. Is is very rare to see a publication in computer science that does not introduce or cite algorithms of some kind or another. It is therefore necessary to extract and index algorithms for search and retrieval. In this thesis we present AlgSeer, a complete architecture for extracting, indexing, and searching for algorithms. We present related work in the areas of citation analysis, document element extraction and speciality search engines that share a similar goal with AlgSeer. We present a complete description of all the pieces that make up the architecture of AlgSeer and we provide in-depth analysis and testing of the each function of the system. We extract algorithms from the set of two million document size CiteSeerX repository and we index and stress test the index of this data. From this data, we extract over 180 thousand algorithms in an XML format totaling 8.3GB. We also provide an analysis of the search showing that our index scales to an extent that far surpasses any realistic prediction of the trac the system would encounter in any practical scenario with a 600QPM stress test on the index. The query response time never exceeds 5MS.