Spectral Top-n:a Computationally Efficient Top-n Algorithm for Big Data Analytics

Open Access
Author:
Hutchison, Kenneth M
Graduate Program:
Industrial Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
June 15, 2015
Committee Members:
  • Soundar Kumara, Dissertation Advisor
  • Guodong Pang, Committee Member
  • Paul Griffin, Committee Member
  • Kamesh Madduri, Committee Member
  • Daniel Antion Finke, Special Member
Keywords:
  • Big Data
  • Statistics
  • Computer Science
  • Top-N
  • Recommender Systems
  • Spectral Analysis
Abstract:
Recommendation systems and their algorithms have been a rich research area of research since the first appearance of collaborative filtering as a data categorization tool. The wealth of available information is growing, and also the research striving to relate this ether of information to the individual. Today, this ether of information has taken on the visage of Big Data; where the individual plays the part of the consumer to make recommendation a seductive research area. Classically, recommendation systems have focused on analyzing relationships between user pairs or user product pairs, which makes a sensible first-cut at many Big Data problems. For instance, given a mass of historical user or purchase data, Top-N can be used to isolate common attributes among groups of users or the groups themselves given $N$ items, which provides a sane default for additional analyst intervention. Recommender analysis is not without its caveats. Foremost, without implementation of the recommender system, one must resort to models which rely on probabilistic or economic theory [and corresponding assumptions] to compare competing methodologies. In addition, by conventional wisdom,the type of recommendation system chosen will also dictate the type of data analyzed and vice versa. That is to say, in a very diverse ecology of items such as Amazon.com, a recommendation algorithm which is computationally inefficient likely will not thrive. In this train of thought, one of the more robust algorithms developed for recommendation systems has been the Top-N algorithm, in that both user and item similarity can be assessed. The Top-N criteria, though powerful, suffers from an exponential component which can make calculation untenable when item sets are large. Specifically, the calculation of similarity in a user-item system is $O(users^2*items)$ since we have to perform $users(users-1)$ calculations of a possible $items$ length. In this thesis, we combine ad-hoc calculation of the Top-N criteria with cluster computing and the relics of signal processing to produce a computationally amenable algorithm for real-time recommendation generation. [Recommender systems by design make assumptions to extrapolate user preference across a dataset of items, and the theory used to measure them also rely heavily on this assumption] In this thesis, we propose an alternative similarity metric based on spectral analysis and benchmark its computational efficacy by comparing it to traditional or ``vanilla'' Top-N Algorithms using the Dotproduct, Cosine, and Hamming measures. The method itself stems from a paradigm shift into the frequency domain: where we leverage the computational efficiency of the Fast Fourier transform in order to obtain a similarity metric suited for Big-Data or real-time analysis. We benchmark the Spectral similarity metric by: comparison to traditional algorithms, through full-factorial experimentation, Big-Data case-studies, and extensions to other areas of data mining which struggle with combinatorial complexity; in this case: the Apriori Algorithm. Our method allows for a small loss in accuracy, but significant benefits in terms of computational speed and random access memory consumption when dealing with the large binary matrices found in Top-N problems. We gain this advantage by using spectral analysis to compress and summarize the data, which decreases the computational effort and space required to compute the item or customer distance matrix.