Identifying Product Web Pages Using Support Vector Machines

Open Access
Gowda Aghalya Shyama Sundar, Deepika
Graduate Program:
Computer Science and Engineering
Master of Science
Document Type:
Master Thesis
Date of Defense:
March 25, 2010
Committee Members:
  • Prasenjit Mitra, Thesis Advisor
  • Machine Learning
Comparative online shopping tools allow users to compare similar products from different vendors. Despite the availability of a multitude of online retail web sites, there is a lack of effective comparative online search tools available for consumer use. Currently, consumers who want to compare similar products from different retail websites carry out the task by searching individual websites. Effective algorithms that can extract accurate product (as opposed to non-product) information from different vendors and represent them on a comparative basis have the potential to significantly reduce online shopping times. As a first step towards building such a comparative tool (for any product category), product web pages need to be identified. The objective of this research is to develop and test algorithms to identify product web pages among a collection of product and non-product web pages. A typical web page can be identified by a Uniform Resource Locator (URL) and contains text (user interface data) and html code (user-hidden data which includes title tags, anchor tags, head tags and body tags) that can be utilized to classify web pages. The first algorithm is based on using URLs to identify product web pages. The second algorithm proposes and tests three methods of screening html information to create feature sets as input data to the Support Vector Machine (SVM) algorithm. Each feature set generated from the three techniques is given as input to the SVM and the classification accuracy is determined. The highest classification accuracy obtained determines the best Hyper Text Markup Language (HTML) screening method to create the feature set. The data set for the first algorithm consisted of seventy six URLs from product and non product web pages of a commercial computer vendor. The data set for the second algorithm consisted of one hundred product and non-product web pages each from four commercial vendors. The experimental result using the first algorithm to identify certain web pages is promising, provided there are valuable keywords in the URLs. Using the feature set generated by method 3, the SVM based algorithm provided a good classification accuracy of 93% and also reduced the learning phase of the SVM algorithm. The thesis presents experimental results in detail and also discusses the advantages and limitations of the developed algorithms.