Open Access
Lu, Xiaonan
Graduate Program:
Computer Science
Doctor of Philosophy
Document Type:
Date of Defense:
August 27, 2008
Committee Members:
  • James Z Wang, Committee Chair
  • C Lee Giles, Committee Chair
  • Wang Chien Lee, Committee Member
  • Sencun Zhu, Committee Member
  • David Miller, Committee Member
  • metadata extraction
  • image analysis
  • document search
This thesis work is mainly focused on two problems related to document search. The first problem is the analysis and utilization of images contained within documents for document retrieval applications. The second problem is the metadata generation for scanned scientific documents at web based archives. Images are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the images. This thesis proposes an integrated document retrieval schema utilizing both text and image information. As the initial step in enabling integrated document search, images are categorized into a set of pre-defined types. Several categories of images have been defined based on their functionalities in scholarly articles. A machine-learning-based approach has been proposed to categorize images using both global features and part features extracted from content of images. After categorization of images, algorithms have been designed to analyze two common types of images in documents: 2-D plots and diagrams. A thin line analysis based algorithm has been designed for extracting numerical data from 2-D plot images. An integrated algorithm has been designed for symbol recognition in diagrams. The proposed approach has been evaluated on a test bed document set collected from the CiteSeer scientific literature digital library and other sources. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for real world use. Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. This thesis work tackles the problem of extracting metadata from scanned volumes of journals. The goal is to extract information describing internal structures and content of scanned volumes, which is necessary for providing effective content access functionalities to digital library users. Methods have been designed for automatically generating volume level, issue level, and article level metadata based on format and text features extracted from scanned volumes. The automatic metadata generation software has been developed and integrated into an operational digital library, the Internet Archive, for real world usage.