Impact of soft errors on scientific simulations

Open Access
Bangalore Srinivasmurthy, Sowmyalatha
Graduate Program:
Computer Science and Engineering
Master of Science
Document Type:
Master Thesis
Date of Defense:
October 07, 2011
Committee Members:
  • Padma Raghavan, Thesis Advisor
  • sparse matrix
  • iterative linear solvers
  • soft error
The trends in computing processor technology are driving toward multicores through miniaturization that can pack many processors in a given chip area. This miniaturization has led to a significant increase in the occurrence of soft errors, where a single bit flip impacts the output of the computing system. This in-turn affects the performance of the application running on the system. In this thesis, we attempt to understand and characterize the impact of soft errors on scientific simulations. We consider the impact of a single soft error on the widely used preconditioned conjugate gradient method (PCG), an important kernel in such scientific simulations. We first show that a single error in PCG can propagate through a sequence of sparse matrix vector multiplication (SpMV) operations that form the core computations in PCG. Consequently, we demonstrate that a single soft error in PCG can lead to performance degradation by factors of 200 or more. Next, we consider the Community Earth System Model (CESM), an extensively used coupled climate model that allows simulation of the earth's climate system. Our experimental results indicate that although the soft errors cause variations in the output of the models, these variations are within the allowable range of perturbations. However, the models are not robust enough and fail upon soft errors in the pointer data structures. These results indicate the need for further study of the impact of soft errors on scientific simulations and the need to develop methods for detection and mitigation.