Open Access
Liu, Chun
Graduate Program:
Computer Science and Engineering
Doctor of Philosophy
Document Type:
Date of Defense:
July 07, 2005
Committee Members:
  • Anand Sivasubramaniam, Committee Chair
  • Mahmut Taylan Kandemir, Committee Chair
  • Mary Jane Irwin, Committee Member
  • Natarajan Gautam, Committee Member
  • multi-threaded characteristics performance power c
Chip multiprocessors (CMPs) are becoming a popular way of exploiting the ever-increasing number of on-chip transistors. Multi-threaded applications aim to more efciently utilize the raw power that CMPs provide than is currently possible. However, current multi-threaded applications exhibit load imbalances at various levels. The increasing capacity for on-chip storage and increasing costs of wire delays make the location of data on the chip very vital. Thus, it is important to place the data in the right location at the right time in the on-chip cache hierarchy. For the purposes of this study, we characterize the load imbalance at the barrier, imbalance amongst cache requests from different cores, and the demands on different blocks of the cache. Using the insights obtained from our characterization study, we then propose techniques that exploit such load imbalances to improve power and performance. For the load imbalance problem at the barrier, we observe that the imbalances are quite predictable. Using an integrated hardware-software mechanism, we propose a novel technique for optimizing the power consumption of CMPs. By using a high-level synchronization construct called barrier, our technique tracks the idle times spent by a processor waiting for other processors to arrive at the same point in the program. Using this knowledge, the frequency of the processors can be modulated to reduce/eliminate idle time, thus providing power savings without compromising performance. For the load imbalance problems imposed on the L2 cache by the different cores, we notice that the possible imbalance between the L2 demands across the cores favors a shared L2 organization, while the interference due to these demands favors a private L2 organization. We propose a new architecture, called Shared Processor-Based Split L2. The new architecture captures the benets of both types of organizations while avoiding many of their drawbacks. We also study the demands on different blocks of the L2 cache, namely actively shared blocks and mostly privately accessed blocks. We show that, while there are a considerable number of L2 accesses to shared data, the actual volume of data is relatively low. Consequently, it is important to keep the shared data fairly close to all processor cores for both performance and power reasons. Motivated by this observation, we propose a small center cell cache residing in the middle of the processor cores which provides fast access to the cores' contents. We demonstrate that this cache organization can considerably lower the number of block migrations between the L2 portions that are closer to each core, thus providing better performance. Combined with sequential tag-data access, the power consumption of such a shared cache can be reduced further.