discrete distribution clustering in big data and a method for storm prediction leveraging large historical archives
Open Access
- Author:
- Zhang, Yu
- Graduate Program:
- Information Sciences and Technology
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- October 14, 2014
- Committee Members:
- James Z Wang, Dissertation Advisor/Co-Advisor
Jia Li, Dissertation Advisor/Co-Advisor
James Z Wang, Committee Chair/Co-Chair
Jia Li, Committee Member
C Lee Giles, Committee Member
Anna Cinzia Squicciarini, Committee Member
Chris Eliot Forest, Committee Member - Keywords:
- discrete distribution
clustering
parallel computing
image annotation
protein clustering
storm prediction - Abstract:
- Big data brings new challenges and opportunities in many scientific areas today. Characterized by the high volume, velocity, and variety (3Vs) model, big data is valuable in many knowledge discovery applications, whereas requires new methodologies and technologies to manage and make use of the data. In this dissertation, a fundamental methodology and an emerging application of big data are presented. First, the parallel discrete distribution (PD2) clustering algorithm is designed and implemented. Discrete distributions are well adopted data signatures in information retrieval and machine learning, and discrete distribution (D2) clustering is a fundamental methodology. However, the high computational complexity of D2-clustering limits its impact on massive learning problems. PD2-clustering with substantially improved scalability facilitates unsupervised learning in many big data applications. Extensive analysis and experiments are presented to demonstrate the effectiveness and advantages of PD2-clustering. Second, satellite image analysis for storm forecasting is explored as an application of big data in meteorology. A large amount of historical satellite images and storm report archives are mined to predict storms. The proposed algorithm extracts visual storm signatures from satellite image sequences in a way similar to how meteorologists interpret them, and incorporates past meteorological records to model and classify the signatures. Such a big-data-driven approach aims at overcoming the intrinsic numerical instability of the conventional weather forecasting approach based on physical numerical models, and serves as a new component in a weather forecasting system. Experimental results in both studies show the benefits of leveraging big data in multiple areas.