Reuse distance models for accelerating scientific computing workloads on multicore processors

Open Access
Author:
Park, Jeonghyung
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
May 04, 2015
Committee Members:
  • Padma Raghavan, Dissertation Advisor
  • Padma Raghavan, Committee Chair
  • Mahmut Taylan Kandemir, Committee Member
  • Kamesh Madduri, Committee Member
  • Christopher J Duffy, Committee Member
Keywords:
  • high performance computing
  • scientific computing
  • parallel computing
  • performance optimization
Abstract:
As the number of cores increase in chip multiprocessor microarchitecture (CMP) or multicores, we often observe performance degradation due to complex memory behavior on such systems. To mitigate such inefficiencies, we develop schemes that can be used to characterize and improve the memory behavior of a multicore node for scientific computing applications that require high performance. We leverage the fact that such scientific computing applications often comprise code blocks that are repeated, leading to certain periodic properties. We conjecture that their periodic properties and their observable impacts on cache performance can be characterized in sufficient detail by simple 'alpha + beta*sine'models. Additionally, starting from such a model of the observable reuse distances, we develop a predictive cache miss model, followed by appropriate extensions for predictive capability in the presence of interference. We consider the utilization of our reuse distance and cache miss models for accelerating scientific workloads on multicore system. We use our cache miss model to determine a set of preferred applications to be co-scheduled with a given application to minimize performance degradation from interference. Further, we propose a reuse distance reducing ordering that improves the performance of Laplacian mesh smoothing. We reorder mesh vertices based on the initial quality for each node and its neighboring nodes so that we can improve both temporal and spatial localities. The reordering results show that 38.75% of performance improvement of Laplacian mesh smoothing can be obtained by our reuse distance reducing ordering when running on a single core. 75x of speedup is obtained when scaling up to 32 cores.