Improving Prediction, Recommendation, and Classification using User-generated Content and Relationships

Open Access
Author:
Chang, Hau-wen
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
December 02, 2014
Committee Members:
  • Dongwon Lee, Dissertation Advisor
  • Wang Chien Lee, Committee Chair
  • Kamesh Madduri, Committee Member
  • Dinghao Wu, Committee Member
Keywords:
  • data mining
  • location estimation
  • recommender system
  • keyword recommendation
  • text minging
Abstract:
In the dominance of social networks era, vast information is created and shared across the world each day. The uniqueness and the prevalence of these user-generated content present both challenges and opportunities. In this thesis, in particular, we study several tasks on mining the user-generated content with regard to textual content and link-based content. First, we study the home location estimation for Twitter users from their shared textual content. We employ Gaussian Mixture Model to compensate the drawback in the Maximum Likelihood Estimation. We propose two unsupervised feature selection methods based on the notions of Non-Localness and Geometric-Localness to prune noisy data in the content. Second, we study the item recommendation problem with a broader view of a social network system. By taking various relationships into consideration, the data sparseness problem common in recommendation tasks are alleviated. Based on the same characteristics principle, we propose a matrix co-factorization framework with a shared latent space to optimize the recommendation collectively. Several algorithms are proposed under the framework to exploit intricate relationships in a social network system. Finally, we investigate the effectiveness of classification with the imperfect textual content extracted from videos, where often very limited information is available. Through means of automatic recognition techniques, some link-based content is enriched with a trade-off of incorrectness. We also propose a heuristics-based method to extract n-gram keyphrases from noisy textual content by taking the importance of sub-term keywords into consideration.