Parallelism-aware Resource Management Techniques for Many-core Architectures

Open Access
Kayiran, Onur
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
June 12, 2015
Committee Members:
  • Chitaranjan Das, Dissertation Advisor
  • Mahmut Taylan Kandemir, Dissertation Advisor
  • Chitaranjan Das, Committee Chair
  • Yuan Xie, Committee Member
  • Kenneth Jenkins, Committee Member
  • Onur Mutlu, Special Member
  • GPU
  • architecture
  • many-core
  • throughput
  • warp
  • wavefront
  • scheduling
  • TLP
  • thread
  • parallelism
  • CTA
  • heterogeneous
General-purpose graphics processing units (GPGPUs) are at their best in accelerating computation by exploiting abundant thread-level parallelism (TLP) offered by many classes of high performance computing applications. To support such highly-parallel applications, GPUs are designed to execute lightweight threads using hundreds/thousands of execution units. Such execution allows high latency tolerance by increasing the opportunities to find available threads while other threads are stalling due to memory accesses. This throughput-oriented computing paradigm, combined with the benefits of technology scaling, allows GPUs to to adopt a scale-up approach, where instead of only increasing the core counts, the peak throughput and capabilities of individual cores are increasing as well. Furthermore, technology scaling has allowed GPUs to be more tightly integrated with CPUs, giving rise to heterogeneous architectures where CPUs and GPUs are placed on the same die/package, and share the same set of resources. However, all these trends entail issues that impede these highly parallel architectures to reach their peak performance. The main motivation of this dissertation is to propose techniques to alleviate the performance limiting factors in parallel architectures. It consists of three main components. The first part of this dissertation focuses on the effect of application parallelism on GPU performance. It identifies that increasing parallelism is not always beneficial in terms of performance, and proposes low-overhead scheduling strategies to optimize the parallelism exhibited by the application, to improve performance. The second part of this dissertation targets the problem caused by high parallelism in the context of heterogeneous architectures. It investigates the effect of GPU parallelism on the performance of both CPU and GPU applications, and proposes scheduling techniques that target performance improvements for both classes of applications. The third part of this dissertation identifies the power and performance related challenges brought with the trend of increasing GPU core capabilities, and proposes queueing-theory based dynamic power- and clock-gating mechanisms to improve performance as well as reduce the power consumption of GPUs.