Addressing Reliability Issues in Performance-Critical Processor Structures

Open Access
Author:
Soundararajan, Niranjan Kumar
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
June 07, 2010
Committee Members:
  • Vijaykrishnan Narayanan, Dissertation Advisor
  • Anand Sivasubramaniam, Committee Chair
  • Vijaykrishnan Narayanan, Committee Chair
  • Mary Jane Irwin, Committee Member
  • Yuan Xie, Committee Member
  • Suman Datta, Committee Member
Keywords:
  • Microprocessor
  • Reliability
  • Soft Errors
  • Process Variations
  • Wearout
  • Fault Tolerance
Abstract:
Diminishing transistor sizes combined with power and performance constraints have decreased the inherent robustness in devices. With each new generation, designers find it difficult to provide deterministic operating characteristics or guarantee a certain lifetime. This has led to the need for architectural designs that not only need to be performance-efficient but also provide reliable operation as well. This dissertation is a step towards developing architectural designs that address reliability issues. As a case in point, we look at two performance-critical processor structures - the issue queue and reorder buffer. Given that devices can fail at any point in their lifetime, our solutions address failures occurring at these different periods. These include manufacturing process variations that affect the operating speeds of the device, soft errors that flip the stored data value and wearout failures that severely limit the useful operating period of a device. Issue queues in high performance microprocessors involve multiple activities, moving instructions in and out of the queue, occurring within a single cycle. Process variation in the issue queues limit the synchronization that exists amongst these different activities in turn limiting their operating frequency. Our solution allows the faster and slower cells in the issue queue to co-exist limiting the stall cycles even at higher frequencies, thereby allowing performance scaling to continue. Soft errors occur when random particles strikes cause a transistor state to flip. Complete protection against soft errors is a overkill for certain market segments as this protection comes at the cost of performance. We provide microarchitectural solutions that provide guarantees on limiting the soft error vulnerability of a platform based on limits required by the market. Our solution enables flexible spanning of the performance-reliability space depending on requirements. Further, we explore the impact of soft error vulnerability in multicores. We do a detailed analysis and identify factors that determine the overall vulnerability when the number of cores gets scaled and use this analysis to provide a runtime solution to minimize the soft error vulnerability. Also prior works have ignored the soft error vulnerability when scaling the voltage and frequency of a system. Our analysis clearly shows the need for designers to be aware of the reliability impact when adopting different voltage frequency algorithms. Finally, we address wearout issues in the issue queue which lead to entries getting shut down early in a microprocessor’s lifetime, in turn affecting the overall performance. Our solutions to limit the variation in wearout in the issue queue entries decrease the degradation significantly and allow its entries to age more uniformly. This in turn helps them meet lifetime requirements. Together, this dissertation provides a comprehensive framework minimizing the impact of failures at different points in a device’s lifetime. This work is a step towards developing fault-tolerant architectural designs that are performance- and cost-competitive across all market segments.