A Fresh Look At Data Locality On Emerging Multicores And Manycores

Open Access
Author:
Ding, Wei
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
June 18, 2014
Committee Members:
  • Mahmut Taylan Kandemir, Dissertation Advisor
  • Mary Jane Irwin, Committee Member
  • Padma Raghavan, Committee Member
  • Dinghao Wu, Committee Member
Keywords:
  • Data Locality
  • Multicore
  • Manycore
  • Compiler
  • Loop
Abstract:
The emergence of multicore platforms offers several opportunities for boosting ap- plication performance. These opportunities, which include parallelism and data locality benefits, require strong support from compilers as well as operating sys- tems. However, architectural abstractions relevant to memory system are scarce in current programming and compiler systems. In fact, most compilers do not take any memory system specific parameter into account even when they are perform- ing data locality optimizations. Instead, their locality optimizations are driven by rule-of-thumbs such as “maximizing stride-1 accesses in innermost loop positions”. There are a few compilers that take cache and memory specific parameters into account to look at the data locality problem in a global sense. One of these parameters is the on-chip cache hierarchy, which determines the core connection and thus data sharing between computations on different cores. Another parameter is the memory controller. In a network-on-chip (NoC) based multicore architecture, an off-chip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In addition, it contends with on-chip (cache) accesses as both use the same NoC resources. The third parameter that will be discussed in this thesis is the row-buffer. Many emerging multicores employ banked memory systems and each bank is attached a row-buffer that holds the most-recently accessed memory row (page). A last-level cache miss that also misses in the row-buffer can experience much higher latency than a cache miss that hits in the row-buffer. Consequently, optimizing for row-buffer locality can be as important as optimizing for cache locality. Motivated by this, in this thesis, we propose four different compiler-directed “ locality” optimization schemes that take these parameters into account. Specifi- cally, our first scheme targets cache hierarchy-aware loop transformation strategy for multicore architectures. It determines a loop iteration-to-core mapping by iii taking into account application data access pattern and multicore on-chip cache hierarchy. It employs “core vectors” to exploit data reuses at different layers of cache hierarchy based on their reuse distances, with the goal of maximizing data lo- cality at each level while minimizing the data dependences across the cores. In case of dependence free loop nest, we customize our loop scheduling strategy, which, on the other hand, determines a schedule for the iterations assigned to each core, with the goal of reducing data reuse distances across the cores. Our experimental evaluation shows that the proposed loop transformation scheme reduces miss rates at all levels of caches and application execution time significantly, and when sup- ported by scheduling, the reduction in cache miss rates and execution time become much larger. The second scheme explores automatic data layout transformation targeting multithreaded applications running on multicores (which is also cache hierarchy- aware). Our transformation considers both data access patterns exhibited by dif- ferent threads of a multithreaded application and the on-chip cache topology of the target multicore architecture. It automatically determines a customized memory layout for each target array to minimize potential cache conflicts across threads. Our experiments show that, our optimization brings significant benefits over state- of-the-art data locality optimization strategies when tested using 22 benchmark programs on an Intel multicore machine. The results also indicate that this strat- egy is able to scale to larger core counts and it performs better with increased data set sizes. In the third scheme, focusing on multithreaded applications, we propose a compiler-guided off-chip data access localization strategy, which places data el- ements in the memory space such that an off-chip access traverses a minimum number of links (hops) to reach the controller that handles this access request. We present an extensive experimental evaluation of our compiler-guided optimization strategy using a set of 12 multithreaded application programs under both private and shared last level caches. The results collected emphasize the importance of optimizing the off-chip data accesses. The fourth scheme presents a compiler-directed row-buffer locality optimization strategy. This strategy modifies the memory layout of data to increase the number of row-buffer hits without increasing the number of misses in the on-chip cache hierarchy. We implemented our proposed optimization strategy in an open-source compiler and tested its effectiveness in improving the row-buffer performance using a set of multithreaded applications.