Information Extraction and Metadata Annotation for Algorithms in Digital Libraries

Open Access
- Author:
- Tuarob, Suppawong
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- October 08, 2014
- Committee Members:
- C Lee Giles, Dissertation Advisor/Co-Advisor
Prasenjit Mitra, Dissertation Advisor/Co-Advisor
Raj Acharya, Committee Chair/Co-Chair
Robert Collins, Committee Member
Wang Chien Lee, Committee Member
Conrad S Tucker, Committee Member
Lee David Coraor, Special Member - Keywords:
- Algorithms
Search Engine
Machine Learning
Classification - Abstract:
- Algorithms are developed in computer science and related disciplines to provide concise step-by-step instructions for solving particular problems. For example, Dijkstra's algorithm was developed by computer scientist Edsger Dijkstra in 1956 to find a shortest path in a graph with a single source node. As science advances, problems in other disciplines are often transformed into algorithmic ones so that standard algorithms can be applied. For example, algorithms for stock portfolio optimization are used for diversifying search results in information retrieval systems. Though algorithm collections such as encyclopedias (e.g. Wikipedia) and algorithm textbooks manually catalog algorithms and make them available, such collections only contain standard algorithms (algorithms which are well defined and well known). Examples of standard algorithms include Dijkstra's algorithm, Binary search, Fast Fourier transform, etc. Despite the well-established nature of these standard algorithms, they hardly satisfy the needs of algorithm users and researchers who seek cutting-edge solutions to their problems. Fortunately, researchers constantly develop new algorithms to solve new problems that have not been solved before, or new algorithms that improve upon the existing ones. A significant number of scholarly articles in computer science and related disciplines hence contain high-quality algorithms developed by researchers. However, manually searching for algorithms in a pile of papers would be tedious. Therefore, it would be useful to have a system that automatically collects and enables the search for this ever increasing collection of algorithms. Despite the development of various search engine systems for entities in scholarly documents such as tables, figures, mathematical expressions, etc., systems that collect and index algorithms from scholarly documents never existed. The difficulty of automatically identifying algorithm representations and extracting their metadata make this problem challenging. While the needs for algorithm search engines are apparent, literature on the possibility to build such systems is still in the infant stage. This dissertation discusses AlgorithmSeer, a search engine for algorithms. AlgorithmSeer collects algorithms from scholarly documents in the forms of pseudo-codes and algorithmic procedures. It then extracts and indexes the algorithm metadata, allowing algorithms to be searched. We raise the issue of metadata annotation, which can be solved by transferring the annotation from the well-annotated corpus via topical modeling. Finally, we discuss the potential use of algorithm citations in order to enhance the search capacity beyond traditional text based search.