Scalable and robust algorithms for clustering large scale networks

Open Access
Author:
Mandala, Supreet Reddy
Graduate Program:
Industrial Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
August 17, 2012
Committee Members:
  • Soundar Kumara, Dissertation Advisor
  • Tao Yao, Committee Member
  • Arunachalam Ravindran, Committee Member
  • Kalyan Chatterjee, Committee Member
Keywords:
  • Social Networks
  • Clustering
  • Optimization
  • Data Mining
Abstract:
Several social and technological systems around us can be modeled as a network or a graph. The topology of such networks is known to play a major role in understanding and predicting the system behavior. A topological feature which has been observed across several domains is the formation of clusters. A cluster or module or community is a subset of vertices which have high density of con- nections between cluster members and few connections leaving the cluster. The knowledge of clusters is extremely useful in a variety of domains - product rec- ommendations in social networks, characterizing functions on unknown proteins in protein interaction networks, product placement and promotions in retails net- works. In this thesis, we focus four problems addressing different aspects of graph clustering. Firstly, we propose an algorithm which can detect disjoint clusters in graphs. We optimize an existing quality function, called modularity, which quantifies the quality of a given partition. Therefore, detecting clusters is equiva- lent to finding a partition which maximizes modularity. Due to the combinatorial nature of the modularity maximization it is hard to find optimal solutions. We propose an ant colony optimization based heuristic which can detect near optimal partitions in a reasonable amount of time. We validate the effectiveness of our algorithm on various social network data sets. Our second problem addresses the issue of detecting robust graph partitions. The motivation behind this problem is to detect partitions which not only maximize modularity but are also immune perturbations in network topology. In several real world setting, perturbations in network topology can be expected due to noisy experimental data. We develop a robust optimization based max-min optimization model in order to define a ro- bust graph partition. Furthermore, we propose an algorithm which can detect a robust partition while maintaining scalability. The usefulness of robust partitions in illustrated on both real and artificial networks. The third problem deals with detecting multiple diverse alternative partitions with high modularity scores. It iii has already been reported that several dissimilar partitions may exist with quite similar modularity scores. We propose an approach to detect alternative graph partitions by modifying the network topology forcing clustering algorithm to de- tect diverse partitions. We confirm the existence of diverse partitions in several real world networks. A case study is presented for a retail graph (book co-purchase network) at amazon.com illustrating the utility of our approach. As a part of the fourth problem, we propose a first principles definition of a cluster which does not have any limitations of modularity based approaches. Moreover, our approach is able to detect overlapping clusters. We show that our definition of a cluster has a direct correspondence with the nash equilibrium of a non-cooperative game. We define a clustering game and show the existence of a nash equilibrium. Both parallel and non-parallel algorithms are proposed which have a near linear time running time. We compare our approach to existing overlapping cluster detec- tion techniques when tested on artificial networks. Also, potential applications are shown in natural language processing (word sense disambiguation) and WWW segmentation problems.