EVENT DETECTION AND PREDICTION USING ONLINE USER GENERATED DATA

Open Access
- Author:
- Lim, Sunghoon
- Graduate Program:
- Industrial Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- February 26, 2018
- Committee Members:
- Conrad S Tucker, Dissertation Advisor/Co-Advisor
Conrad S Tucker, Committee Chair/Co-Chair
Soundar Kumara, Committee Member
Ling Rothrock, Committee Member
Nilam Ram, Outside Member - Keywords:
- Machine Learning
Text Mining
Social Media Analytics
Online Data Analytics
Event Detection
Event Prediction - Abstract:
- Data-driven event detection and prediction are a fundamental research challenge of the 21st century. Data-driven event detection and prediction provide valuable current and future knowledge not only to large organizations, such as enterprises and hospitals, but also to individuals, such as customers and patients, respectively. In particular, textual data have been widely used as a primary knowledge source for data-driven event detection and prediction, since 80 percent of the digital data that have been generated by society today originates in unstructured textual form. Unfortunately, existing studies on text-data-driven event detection and prediction typically employ top-down machine learning methods, which are constrained by their need for (1) datasets of examples for training the models or (2) predetermined search keywords. However, in many cases, generating datasets of examples is an expensive process and impractical for many real-world applications. Furthermore, it is also difficult to use predetermined indicators for new event detection and prediction. The objective of this dissertation is to create bottom-up machine learning models, which do not require datasets of examples for training the models or predetermined search keywords, for text-data-driven event detection and prediction. The bottom-up machine learning models reduce type I and type II identification errors during the process of text-data-driven event detection and prediction. Reducing type I identification errors (i.e., false positives) is crucial for decreasing the misidentification of irrelevant data as relevant data, as such errors reduce the quality of necessary data needed for event detection and prediction. Reducing type II identification errors (i.e., false negatives) is also important, because increasing the size of correctly identified data can improve the quantity of necessary data needed for event detection and prediction. In particular, this research uses online user generated data, such as social media data, as a knowledge source for event detection and prediction due to (1) the availability of user opinions related to a wide range of topics (from a user’s perspective); (2) the ability to acquire user feedback in real-time and at a low-cost (from an analyst’s perspective); and (3) the size and heterogeneity of the data. In this dissertation, first, a Bayesian sampling model is presented for determining appropriate search keywords that reduce type I and type II identification errors when detecting events, such as detecting users’ feedback on product features or users’ medical conditions. Second, a clustering-based model using sentiment analysis is proposed for detecting the spread of events, such as detecting the spread of positive/negative online user feedback or the spread of a latent disease(s). Third, a causal analysis model based on word co-occurrence networks and Ganger causality analysis is provided for event prediction, such as predicting the spread of positive/negative user generated content or future enterprise outcomes. The bottom-up machine learning models presented in this dissertation can be used in a wide range of fields of event detection and prediction using online user generated data.