Designing New Crawling and Indexing Techniques for Web Search Engines

Open Access
Tan, Qingzhao
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
October 06, 2008
Committee Members:
  • Prasenjit Mitra, Dissertation Advisor/Co-Advisor
  • Prasenjit Mitra, Committee Chair/Co-Chair
  • C. Lee Giles, Committee Chair/Co-Chair
  • Jesse Louis Barlow, Committee Member
  • Robert Collins, Committee Member
  • Dean R. Snow, Committee Member
  • digital library
  • information retrieval
  • incremental crawler
  • Search engine
The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and changes of the Web. This thesis studies in a Web search engine how a crawler with limited computing resource can effectively crawl from the dynamically changing Web and acquire the most updated Web documents, and how a Web search engine can provide information-object--oriented indexing methods which enable users to retrieve desired information with high accuracy and high efficiency. To address the first problem, existing solutions apply sampling techniques at the website level. That means, the crawler chooses a few webpages from each website as samples and then re-downloads all the webpages in the website with the largest number of changed samples. We argue that the level of a website may not be a good granularity for sampling because the update patterns of various webpages within the same website could be quite different, while webpages with similar update patterns may distribute across different websites. We design a set of sampling policies with various downloading granularities for the sampling method, taking into account the link structure, the directory structure, and the content-based features which include the clustering technique. We further extend the clustering-based sampling approach by testing more dynamic features and strategically selecting samples from each cluster. For the second problem, once the crawler has downloaded a set of documents and stored them in the search engine, a search engine should allow users to perform accurate search for desired information. As more and more digital documents containing various information objects become accessible on the Web, there is a growing demand for a Web search system to provide users with tools to retrieve documents based on these information objects. The key challenges of this problem are that, a search engine needs to (1) improve the accuracy of returned ranking, (2) enrich the format of search objects, and (3) give informative results to users in different domains. Existing search engines typically maintain large-scale inverted indexes which are built on the whole local data set. These approaches do not focus on information objects in a specific domain. Therefore, they do not meet the above requirements. There are degradations in the accuracy of the returned ranking. To fully address these issues, we propose building indexes on extracted metadata of various information objects, instead of the whole document. This greatly improves the quality of the final returned ranking. As part of this dissertation, we set up a digital library, namely ArchSeer, for the domain of archeology. Archaeologists have different search needs which cannot be provided by a general purpose search engine associated with a digital library, like Google Scholar. Therefore, the need arises for a digital library like ArchSeer, which allows users to retrieve archeology literature via domain-specific search engines. For example, archaeologists often publish maps in their documents and need to search using geo-spatial references. In this dissertation, we show how to design a digital library that performs domain-specific information extraction and indexes them to allow user enhanced search capabilities. The most significant feature of ArchSeer is that it can automatically extract metadata related to different scientific items (e.g., maps and locations) in archeology papers, and further design effective ranking algorithms and heuristics to retrieve these items in the system. This dissertation also provides solid mathematical analyses, extensive simulations and experiments to evaluate the effectiveness and show the applicability of the proposed techniques. In addition, it discusses some open issues related to the proposed solutions and suggests some interesting directions in designing efficient Web search engines.