Mining Texts and Social Users Using Time Series and Latent Topics

Open Access
- Author:
- Yang, Tao
- Graduate Program:
- Information Sciences and Technology
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- December 10, 2013
- Committee Members:
- Dongwon Lee, Dissertation Advisor/Co-Advisor
Xiaolong Zhang, Committee Member
Prasenjit Mitra, Committee Member
Bruce G Lindsay, Committee Member - Keywords:
- text mining
social user mining
time series
topic models - Abstract:
- Knowledge discovery has received tremendous interests and fast developments in both text mining and social user mining. The main purpose is to search massive volumes of data for patterns as so-called knowledge. Knowledge can exist in different formats such as texts or numbers. Knowledge can be observed or hidden in different hierarchies. Knowledge can even be user-generated such as social content and social activity in Web 2.0 era. In this dissertation, we study a series of new knowledge discovery techniques using four data mining applications. First, we propose our novel framework on mining text databases using time series by bridging two seemly unrelated domains - alphabets strings and numerical signals. We study how various transformation methods affect the accuracy and performance of detecting near-duplicate texts in record linkage. Second, we develop new topic models on mining text documents using latent topics to tackle the noisy data problem in document modeling. We show how the incorporation of textual errors and topic dependency into the generative process affect the generalization performance of topic models. Third, we introduce our novel methods in mining social content using time series to classify user interests. We show the accuracy of our approach in both binary and multi-class classification of sports and political interests of social users. Finally, we introduce our generative modeling approach in mining social activity using latent topics to predict user attributes. We show the performance of our methods in predicting binary and multi-class demographical attributes of social users.