Mining User-generated Contents on the Web and Social Networks

Open Access
- Author:
- Huang, Shu
- Graduate Program:
- Information Sciences and Technology
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 12, 2013
- Committee Members:
- Dongwon Lee, Dissertation Advisor/Co-Advisor
Peng Liu, Committee Member
Heng Xu, Committee Member
Jack C Hayya, Committee Member - Keywords:
- data mining
social network
information retrieval - Abstract:
- In solving diverse data management problems, underlying social network between users and semantics hidden deep in User-generated Contents (UGC) can be useful from many perspectives. Finding and applying such hidden semantics of UGC and social correlations illustrates a new way in solving various problems. In this thesis, we study several challenging data management problems to investigate how to apply the framework of UGC mining and social network analysis to substantially improve existing solutions. In particular, we focus on the following four problems: First, we propose a novel query expansion technique in Information Retrieval that exploits the ``location-based'' correlation between users and search engine user logs. We explore the vocabulary of users from different geographic locations and investigate the semantic relations among the documents they search for. Based on that, a hierarchical location and topic based query expansion model is proposed to improve the accuracy of web search. Our proposed model predicts the query location sensitivity with more than 80% precision. Using the model, the final search result is significantly better than several existing query expansion methods. Second, we explore the aggregate social activity and evaluate the significance of various activity features in determining the social activity evolution. In particular, we look in to various formats of social activities and measure how member activity impacts the evolution of the active population. Several activity features are extracted and their impact on the community evolution is evaluated with a feature selection model. Based on the model, the most significant features are identified. Third, we study UGC on Twitter, a large online platform of social media, to identify tweet topics and sentiments towards some preset brands/products. To help understand brand perception and customer opinions, we utilize the correlation of tweet sentiments and topics, and propose a multi-task multi-label (MTML) classification model that performs classification of both sentiments and topics simultaneously. It incorporates results of each task from prior steps to promote and reinforce the other iteratively. Meanwhile, by using multiple labels, the class ambiguity can be addressed. Compared with baselines, MTML produces a much higher accuracy of both sentiment and topic classification. Furthermore, based on tweet sentiment analysis, social network among Twitter users is also taken into consideration to investigate the impact of events on tweet sentiment change. By mining tweets about 2012 USA presidential campaign, we analyze the sentiments towards the presidential candidates. Meanwhile, we incorporate social correlation between Twitter users and present a method to predict the impact of events based on social activities. Analysis on tweets collected over 8 months shows that our method can predict the sentiment change with high accuracy. Mining UGC and social network is not only efficient but also effective in predicting the impact of events.