Design and Exploration of Accelerator-rich Multi-core Systems

Chandramoorthy, Nandhini

Design and Exploration of Accelerator-rich Multi-core Systems

Open Access

Author:: Chandramoorthy, Nandhini
Graduate Program:: Computer Science and Engineering
Degree:: Doctor of Philosophy
Document Type:: Dissertation
Date of Defense:: October 05, 2016
Committee Members:: Vijaykrishnan Narayanan, Dissertation Advisor/Co-Advisor
John Morgan Sampson, Committee Chair/Co-Chair
Chitaranjan Das, Committee Member
Mary Beth Rosson, Committee Member
Kevin Irick, Outside Member
Luca Benini, Special Member
Keywords:: heterogeneous architectures
processor modeling
accelerator-rich architectures
Abstract:: Limited power budgets and the need for high performance computing have led to platform customization with a number of accelerators integrated with many-core CPUs. The design space of architectures with CPUs and accelerators forms a continuum with varying degrees of specialization in execution data paths, memories, data communication and control. In order to study customized architectures, this dissertation focuses on computer vision workloads and architectures optimized to implement these workloads. Exploiting similarities in computer vision workloads, this dissertation presents configurable functional units or accelerators that implement frequently performed operations in these workloads and optimized for energy-efficiency. A detailed study of how to map workloads on accelerators is presented. This dissertation models customization design points and compares their performance and energy across a number of computer vision workloads. The limitations of generic architectures are analyzed and the costs and benefits of increasing customization using these micro-architectural design points is quantified.The sources of performance and energy efficiency in customized architectures are identified. The impact of a)specialized functional units b)local memories optimized for specific memory access pattern c) optimization of data transfer from external memories into local memories, and from local memories into functional unit registers is studied in detail. This analysis leads us to develop a framework consisting of low-power multi-cores and an array of configurable micro-accelerator functional units, performing the best for the chosen domain of computer vision workloads. Using this platform, this dissertation illustrates data flow and control processing optimizations that provide for performance gains similar to custom ASICs for vision benchmarks. The scaling of such systems with multi core processors to larger and larger number of cores and integration with a large number of optimized accelerators substantially increases both computation and data transfers. A comprehensive design-space exploration of multi core architectures and accelerators with shared memories using cycle-accurate full-chip simulators or design and synthesis is becoming impractical due to prohibitive simulation times. In order to study the impact of sharing memories and data flow between local memories in CPUs and local memories in accelerators, there is a need for fast, accurate and scalable models that aid in system architecture design space exploration. Therefore, this dissertation presents an analytical model that abstracts the processing unit, and characterizes it in terms of memory interaction alone. The presented model, Performance Estimation through Contention ANalysis (PECAN), is a tool that models various uncore components, and estimates per-core performance in multi core architectures and accelerators. Using the developed framework, this dissertation carries out a fast uncore design space exploration of multi core processors with accelerators.

Tools