CO-ADAPTING SCIENTIFIC APPLICATIONS AND ARCHITECTURES TOWARD ENERGY-EFFICIENT HIGH PERFORMANCE COMPUTING

Open Access
Author:
Malkowski, Konrad
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
July 21, 2008
Committee Members:
  • Padma Raghavan, Dissertation Advisor
  • Padma Raghavan, Committee Chair
  • Mary Jane Irwin, Committee Member
  • Mahmut Taylan Kandemir, Committee Member
  • Michael Perrone Iii, Committee Member
  • Jia Li, Committee Member
Keywords:
  • scientific applications
  • power
  • adaptivity
  • optimization
  • energy-efficient
  • architecture
  • performance
Abstract:
<p>Many fields in science and engineering rely on computational models for design and discovery. These models are typically derived from either experimental data or from a mathematical description of the physical phenomena. In most cases, the underlying computer representation is sparse, involving matrices with few non zero entries per row, graphs with just a few edges, or meshes with few neighboring elements per node. Such models are computationally solved at a scale corresponding to the desired level of detail and require large multi-processor systems to assure reasonable execution times. Owing to micro-architectural trends, the power and cooling costs in such multi-processor systems are high, and the efficiency of sparse applications is relatively low compared to theoretical peak performance. Therefore, in this thesis, we consider the challenges of enabling energy-efficient sparse scientific computing. <p>Our goal is to improve the energy efficiency of sparse scientific applications while maintaining or improving performance. Therefore, we investigate opportunities for energy and performance improvements in order of decreasing scale of size. First, we consider opportunities for improving energy efficiency at the whole multi-processor system scale. We then investigate single node optimizations, starting with single-core and moving towards multi-core power and performance optimizations. <p>At the multi-processor system scale, we start by considering opportunities for improving workload distribution in parallel applications where computational dependencies and parallelism across the computer nodes are described by a tree. We introduce a new multi-pass workload-to-processor distribution technique and a workload distribution quality measure which we call critical overload. Our mapping scheme decreases the worst-case critical overload for some cases from approximately 60% to 27%. On average, our technique halves the critical overload for our test suite from nearly 50% for the original mapping to 25%. However, the workload imbalance is not completely eliminated. Thus, we consider opportunities for converting the remaining workload imbalances into energy savings by using dynamic voltage and frequency scaling (DVFS) and ``just-in-time' computation principles. We introduce three schemes that offer up to a 21% reduction in energy consumption on average across our test suite without degrading application performance. <p>At the scale of a single node of a multi-processor system, we consider power and performance optimizations for single-core and multi-core architectures. We begin by investigating interactions between hardware features and sparse applications properties on single-core processors. We show that energy savings of up to 64% are possible (without degrading application performance) when the right combination of hardware, code optimizations, and data structures are used. Next, we propose new hardware features aimed at optimizing performance and energy consumption of sparse scientific applications on single-core processors. We introduce a load miss predictor and a phase-aware adaptive hardware selection control. Combined with DVFS, our load miss predictor reduces system power by 7.3% and energy by 17.3% while maintaining an 8.7% improvement in execution time. Our phase-aware, adaptive hardware selection controller reduces the power consumption of the test applications by as much as 39% and energy consumption by as much as 37% without negatively impacting the performance of applications. <p>In the multi-core, single node design space, we consider the problem of energy-efficient data lookup in shared non-uniform cache access architecture (NUCA) L2 caches. We propose a novel tree-based cache architecture as an alternative to static placement NUCA (S-NUCA), and dynamic placement, broadcast-based NUCA (D-NUCA) architectures. In the best instance, in the single-program scenario, our tree-based cache architecture, T-NUCA, outperforms the D-NUCA by 27%, at a 35% overhead in number of transistors used, while reducing the network link use by 48%, energy consumption by 93%, and EDP by 95%. Relative to S-NUCA, in the best instance, our T-NUCA reduces the execution time by 38%, increases the link use by 161% and energy consumption by 32% due to directory overheads, and reduces the EDP by 19%. <p>We conclude this thesis with a discussion of our results, associated open problems and future research directions.