EFFECTIVE METHODS FOR WEB CRAWLING AND WEB INFORMATION EXTRACTION
Open Access
Author:
Zheng, Shuyi
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
April 11, 2011
Committee Members:
C Lee Giles, Dissertation Advisor/Co-Advisor C Lee Giles, Committee Chair/Co-Chair Jesse Louis Barlow, Committee Member Daniel Kifer, Committee Member Murali Haran, Committee Member Raj Acharya, Committee Member
Keywords:
Wrapper Induction Information Extraction Web Crawling Machine Learning
Abstract:
Crawling and information extraction are two fundamental components for almost all web-scale search engines. They are usually the first two steps of a search engine's system pipeline. First, a web crawling system (a.k.a ``crawler', ``robot', or ``spider') traverses the web in certain manner and provides the raw content (crawled webpages) for search engines. Then, an extraction system is used to understand those crawled pages correctly before they can be indexed and presented to the end user.
This research includes several works in the context of real systems which attempt to address the some fundamental issues that are often encountered in web crawling and web information extraction. (1) How to design and manage a large scale web crawler (2) How to select seeds for a web-scale crawler; (3) How to order URLs to effectively obtain relevant documents (4) How to extract information from news articles and homepages that do not conform to any template; (5) How to combine template detection and wrapper generation for template dependent information extraction; (6) How to reduce the labeling cost for wrapper induction.