discrete distribution clustering in big data and a method for storm prediction leveraging large historical archives

Open Access
Zhang, Yu
Graduate Program:
Information Sciences and Technology
Doctor of Philosophy
Document Type:
Date of Defense:
October 14, 2014
Committee Members:
  • James Z Wang, Dissertation Advisor/Co-Advisor
  • Jia Li, Dissertation Advisor/Co-Advisor
  • James Z Wang, Committee Chair/Co-Chair
  • Jia Li, Committee Member
  • C Lee Giles, Committee Member
  • Anna Cinzia Squicciarini, Committee Member
  • Chris Eliot Forest, Committee Member
  • discrete distribution
  • clustering
  • parallel computing
  • image annotation
  • protein clustering
  • storm prediction
Big data brings new challenges and opportunities in many scientific areas today. Characterized by the high volume, velocity, and variety (3Vs) model, big data is valuable in many knowledge discovery applications, whereas requires new methodologies and technologies to manage and make use of the data. In this dissertation, a fundamental methodology and an emerging application of big data are presented. First, the parallel discrete distribution (PD2) clustering algorithm is designed and implemented. Discrete distributions are well adopted data signatures in information retrieval and machine learning, and discrete distribution (D2) clustering is a fundamental methodology. However, the high computational complexity of D2-clustering limits its impact on massive learning problems. PD2-clustering with substantially improved scalability facilitates unsupervised learning in many big data applications. Extensive analysis and experiments are presented to demonstrate the effectiveness and advantages of PD2-clustering. Second, satellite image analysis for storm forecasting is explored as an application of big data in meteorology. A large amount of historical satellite images and storm report archives are mined to predict storms. The proposed algorithm extracts visual storm signatures from satellite image sequences in a way similar to how meteorologists interpret them, and incorporates past meteorological records to model and classify the signatures. Such a big-data-driven approach aims at overcoming the intrinsic numerical instability of the conventional weather forecasting approach based on physical numerical models, and serves as a new component in a weather forecasting system. Experimental results in both studies show the benefits of leveraging big data in multiple areas.