Social-enriched Data Analysis and Processing Tools
Open Access
- Author:
- Liu, Xingjie
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- February 28, 2013
- Committee Members:
- Wang Chien Lee, Dissertation Advisor/Co-Advisor
Wang Chien Lee, Committee Chair/Co-Chair
Piotr Berman, Committee Member
Daniel Kifer, Committee Member
Jia Li, Committee Member - Keywords:
- Social Network Analysis
Recommendation System
Distributed System - Abstract:
- In recent years, the rapid development of online social services, such as Facebook, Twitter, LinkedIn and Foursquare, poses new opportunities and challenges to researchers. On the one hand, with huge amount of comprehensive social network data and various types of user-generated contents made available for analysis, we are able to conduct in-depth studies on the scale we never had before. The data will help us better understand people's opinions and activities, capture trends in our society and improve social services. On the other hand, however, such data require novel techniques for modeling, extraction and processing to reveal its real value, because many existing solutions cannot handle new issues such as heterogeneous data type, scalability and efficiency requirement, etc. In this thesis, we introduce the concept of social-enriched data, which is defined as the social connection graphs as well as user-created contents distributed on the graphs, to represent the data collected in the aforementioned online social services. We identified several issues in handling the social-enriched data and proposed a set of novel solutions to tackle these issues. First, as people tend to interact with others both through the online services and in their offline lives, capturing the properties of heterogeneous social networks with both online and offline components becomes critical. Hence, we investigated a new type of social network as Event-based Social Networks(EBSNs) as a typical example for the heterogeneous graph. The EBSNs contain both online social interactions as in other conventional online social networks, as well as offline social interactions captured in offline activities. Based on real data collected from Meetup, a social event organizing service, we analyzed EBSN properties and discovered many unique and interesting characteristics, such as heavy-tailed degree distributions and strong locality of social interactions. In addition, we subsequently studied the heterogeneous nature (co-existence of both online and offline social interactions) of EBSNs on two challenging problems: community detection and information flow. We found that communities detected in EBSNs are more cohesive than those in other types of social networks (e.g. location-based social networks). In the context of information flow, we studied the event recommendation problem and significantly improved the recommendation with a community-based diffusion model which infuses both online and offline interactions. Second, as user-created contents consist of one essential ingredient of many online social services, we chose to study it in a widely applied practice, i.e., recommendations. In particular, we focused on the problem of recommending contents for a group of users by utilizing the social context. To extract the group user preference information from the social-enriched data, we analyzed the decision making process in user groups, and proposed a personal impact topic (PIT) model as a type of probabilistic generative model. The PIT model effectively identifies the group preference profile for a given group by mining the individual preferences and personal impacts of group members from group recommendation history. Further, we integrate the friends connection information to obtain an extended personal impact topic (E-PIT) model. Through comprehensive data analysis and evaluations conducted on three real datasets, we demonstrate that the social based PIT and E-PIT approachs achieved good performance. Finally, to support efficient data analysis and combat the scalability issues, we proposed two data analyzing tools for social-enriched data, namely, distributed graph summary and uncertain skyline query. The distributed graph summary algorithms summarize a large scale graph into an abstract graph, where the topologies of the original graph is preserved. As online social networks can become extremely large and complex, graph summarization is crucial in uncovering useful insights about the patterns hidden in the underlying graphs. In our study, we introduce three distributed algorithms enable parallel processing of graph summarization, which produce good quality summaries and scales well with increasing data sizes. The uncertain skyline operator is a data filtering operator to identify a set of data items that are not dominated by any other items, where each item is represented as a multidimensional data tuple with probabilistic attribute values. The operator is particularly useful for multi-criteria data analysis and filtering for user created contents. Specifically, the U-Skyline query searches for a set of tuples that has the highest probability (aggregated from all possible scenarios) as the skyline answer. In order to answer U-Skyline queries efficiently, we propose a series of optimization techniques for query processing. Our performance evaluation shows that our algorithm is 10-100 times faster than the state-of-art solutions. Social-enriched data analysis gains more and more research interests today. This thesis presents pioneer works in several challenging topics in this area, and we believe that our solutions will provide real value to the utilization of social-enriched data in practice.