TEMPORAL IDENTIFICATION OF USAGE PATTERNS AND OUTLIERS IN WEB SEARCHING USING TENSOR ANALYSIS

Open Access
Author:
Gopalakrishna, Chandrika
Graduate Program:
Computer Science and Engineering
Degree:
Master of Science
Document Type:
Master Thesis
Date of Defense:
None
Committee Members:
  • Bernard James Jansen, Thesis Advisor
  • Trent Ray Jaeger, Thesis Advisor
  • C Lee Giles, Thesis Advisor
Keywords:
  • Transaction log
  • Search Engine
  • Tensor
Abstract:
This research attempts to recognize patterns and outliers in the data stream from huge search engine transaction files incorporating tensor analysis. The aim is to analyze the correlation between different attributes of data recorded in a search engine transaction file. From this, one can study the trends in variation of attributes over a period among a set of selected search engines in order to summarize the online search activity. This thesis presents a proof-of-concept that tensor analysis is a valid methodology for mining search engine logs to study correlation of characteristics, identify patterns, and isolate outliers. One of the main challenges involved in analyzing search engine transaction logs is the huge volume of data that is continuously evolving with time, which tensor analysis resolves. The experimental design consisted of two main scenarios aimed at studying trends and attribute correlation in five log files from Web search engines. The trend analysis presents the variation of a set of attributes over a period of 24 hours. The correlation analysis detects two kinds of patterns occurring over this 24-hour period. One of these patterns is recognized as the normal or main trend, while the other as an abnormal trend that is deviating from this main trend. The results show that three of the four search attributes (Search Pattern, Number of Queries and Query length) are positively correlated with each other and negatively correlated with the fourth attribute (User Intent) in the main trend analysis. In the abnormal trend, first and third attributes (Search Pattern and Query length) are negatively correlated with the other two attributes. This type of analysis allows us to identify the outliers in these log entries that contribute towards occurrence of an abnormal pattern. A time window of high search engine usage was also identified during a 24-hour period.