A COMPREHENSIVE STUDY OF THE REGULATION AND BEHAVIOR OF WEB CRAWLERS
Open Access
- Author:
- Sun, Yang
- Graduate Program:
- Information Sciences and Technology
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- August 25, 2008
- Committee Members:
- C Lee Giles, Committee Chair/Co-Chair
James Z Wang, Committee Member
Prasenjit Mitra, Committee Member
Runze Li, Committee Member - Keywords:
- botseer
ethics
bias
web crawler
robots.txt - Abstract:
- Search engines and many web applications such as online marketing agents, intelligent shopping agents, and web data mining agents rely on web crawlers to collect information from the web, which has led to an enormous amount of web traffic generated by crawlers alone. Due to the unregulated open-access nature of the web, crawler activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Ethical crawlers (and many commercial) will follow the rules specified in robots.txt files. Since the Robots Exclusion Protocol has become a de facto standard for crawler regulation, a thorough study of the regulation and behavior of crawlers with respect to the Robots Exclusion Protocol allows us to understand the impact of search engines and the current situation of privacy and security issues related to web crawlers. The Robots Exclusion Protocol allows websites to explicitly specify an access preference for each crawler by name. Such biases may lead to a ``rich get richer" situation, in which a few popular search engines ultimately dominate the web because they have preferred access to resources that are inaccessible to others. We propose a metric to evaluate the degree of bias to which specific crawlers are subjected. We have investigated 7,593 websites covering education, government, news, and business domains, and collected 2,925 distinct robots.txt files. Results of content and statistical analysis of the data confirm that the crawlers of popular search engines and information portals, such as Google, Yahoo, and MSN, are generally favored by most of the websites we have sampled. The biases toward popular search engines are verified by applying the bias metric to 4.6 million robots.txt files from the web. These results also show a strong correlation between the search engine market share and the bias toward particular search engine crawlers. Since the Robots Exclusion Protocol is only an advisory standard, actual crawler behavior may differ from the regulation rules. In other words, crawlers may ignore robots.txt files or violate part of the rules in robots.txt files. A thorough analysis of web access logs reveals many potential ethical and privacy issues in web crawler generated visits. We present the log analysis results of three large scale websites and the applications of the data extracted from the log analysis including estimating the crawler population and user stability measures. To minimize negative aspects of crawler generated visits on websites, the ethical issues of crawler behavior with respect to the crawling rules specified in websites is studied in this thesis. As many web site administrators and policy makers have come to rely on the informal contract set forth by the Robots Exclusion Protocol, the degree to which web crawlers respect robots.txt policies has become an important issue of computer ethics. We analyze the behaviors of web crawlers in a crawler honeypot, a set of websites where each site is configured with a distinct regulation specification using the Robots Exclusion Protocol in order to capture specific behaviors of web crawlers. A set of ethicality models is proposed to measure the ethicality of web crawlers computationally based on their conformance to the regulation rules. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers receive good ethicality scores; however, many commercial crawlers still consistently violate certain robots.txt rules. The bias and ethicality measurement results calculated based on our proposed metrics are important resources for webmasters and policy makers to design websites and policies. We design and develop BotSeer, a web-based robots.txt and crawler search engine that makes these resources available for users. BotSeer currently indexes and analyzes 4.6 million robots.txt files obtained from 17 million websites as well as three large web server logs and provides search services and statistics of web crawlers for researching web crawlers and trends in Robot Exclusion Protocol deployment and adherence. BotSeer serves as a resource for studying the regulation and behavior of web crawlers as well as a tool to inform the creation of effective robots.txt files and crawler implementations.