A Protean Attack on the Compute-storage Gap in High-performance Computing

Open Access
Author:
Wilson, Ellis H
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
May 29, 2014
Committee Members:
  • Mahmut Taylan Kandemir, Dissertation Advisor
  • Mahmut Taylan Kandemir, Committee Chair
  • Padma Raghavan, Committee Member
  • Wang Chien Lee, Committee Member
  • Christopher J Duffy, Committee Member
Keywords:
  • Distributed Storage
  • High-Performance Computing
  • Hadoop
  • Filesystems
  • Compression
  • Deduplication
  • NAND Flash
  • Solid State Disks
Abstract:
Distributed computing, in particular supercomputers, have facilitated a significant acceleration in scientific progress over the last three-quarters of a century by enabling scientists to ask questions that previously held intractable answers. Looking at historical data over the last 20 years for the top supercomputers in the world, we note that they have demonstrated an amazing doubling in performance every 13.5 months, well in excess of Moore's law. Moreover, as these machines grow in computational power, the magnetic hard disk drives (HDDs) they rely upon to store data to and retrieve data from double in capacity roughly every 18 months. These facts considered in concert provide a foundation for the recent data-driven revolution in the way both scientists and businesses extract useful knowledge from their increasing datasets. However, while computation and capacity potential for these machines is growing at a breathless rate, a disturbing but oft-ignored reality is that the ability to access the data on a given HDDs is shrinking by comparison, doubling only once every decade. In short, this means although the capability to process and store the data scientists and businesses are so excited about is here, the ability to access that data (a prerequisite for processing it) falls behind year in and year out. Therefore, the focus in this thesis is to find ways to limit or close the annually widening compute-to-bandwidth gap, specifically for systems at scale such as supercomputers and the cloud. Recognizing that this problem requires improvement at numerous levels in the storage stack, we take a protean approach to seeking and implementing solutions. Specifically, we attack this problem by researching ways to 1) consolidate our storage devices to maximize aggregate bandwidth while enabling best-of-breed analytic approaches, 2) determine optimal data-reduction techniques such as deduplication and compression in the face of a sea of data and a lack of existing analysis tools, and 3) designing novel algorithms to overcome longevity shortcomings in state-of-the-art alternatives to magnetic storage such as flash-based solid-state disks (SSDs).