TABLESEER: AUTOMATIC TABLE EXTRACTION, SEARCH, AND UNDERSTANDING

Open Access
Author:
Liu, Ying
Graduate Program:
Information Sciences and Technology
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
August 21, 2009
Committee Members:
  • Prasenjit Mitra, Dissertation Advisor/Co-Advisor
  • C. Lee Giles, Committee Chair/Co-Chair
  • Prasenjit Mitra, Committee Chair/Co-Chair
  • Dean R. Snow, Committee Member
  • Tracy Mullen, Committee Member
Keywords:
  • Ranking
  • Search Engine
  • Table Extraction
  • Indexing
Abstract:
Tables are ubiquitous with a history that pre-dates that of sentential text. Authors often report a summary of their most important findings using tabular structure in documents. For example, scientists widely use tables to present the latest experimental results or statistical data in a condensed fashion. Along with the explosive development of the digital library and internet, tables have become a valuable information source for information seeking and data analysis. Interest in and use of table data necessitates table indexing and search. However, current search engines do not support table search. The difficulty of automatically extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make the table search problem challenging. Effectively and efficiently searching table data becomes an urgent demand. In this dissertation, we present an automatic table extraction and search engine, TableSeer. TableSeer crawls the web and digital libraries, detects tables from documents using heuristic-based and machine-learning based methods, represents tables using an extensive set of medium-independent table metadata that other people can reuse, indexes table metadata files, ranks tables, and provides a user-friendly search interface. To improve the performance of the table boundary detection, a novel page-box-cutting method and a sparse-line detection method are proposed. Given a keyword-based table search query, TableSeer ranks the matched tables and returns the most relevant tables with a novel table ranking algorithm -- TableRank. TableRank tailors the classic vector space model and adopts an innovative term weighting scheme by ggregating multiple features from three levels: the term, table and document levels. Although tables are widely used, there is no standard on the table structure designing. Many issues that go into the design of tables and will impair the table data readability, accessibility, and re-usability are ignored. In order to have a deep understanding on the table characterization and to improve the table extraction and search performance, we also implement the first large-scale table quantitative study on table natures in digital libraries. We demonstrate the value of TableSeer with empirical studies on scientific documents. The experimental results show that our table search engine outperforms existing search engines on table search. Overall, TableSeer eliminates the burden of manually extracting table data from digital libraries and enables users to automatically examine tables. TableSeer is successfully deployed and in current use in several scientific digital libraries, for example CiteSeerx.