Shared Storage Resource Management to Provide predictable Performance

Open Access
- Author:
- Prabhakar, Ramya Arkalgud
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- March 20, 2012
- Committee Members:
- Dr Mahmut Kandemir, Dissertation Advisor/Co-Advisor
Padma Raghavan, Committee Member
Mary Jane Irwin, Committee Member
Dr Sudharshan Vazhkudai, Special Member
Dr Jia Li, Committee Member - Keywords:
- Storage systems
I/O
High Performance Computing
QoS - Abstract:
- Emerging high-end computing platforms at petascale provide new horizons for complex modeling and large-scale simulations. While these systems have unprecedented levels of peak computational power and data storage capacity, a critical challenge concerns the design and implementation of scalable application and system software that make it possible to harness the power of these systems for large scale I/O intensive applications. Increasing complexity of large-scale applications and continuous increases in data set sizes of such applications combined with slow improvements in disk access latencies have resulted in I/O becoming a performance bottleneck. State-of-the-art storage systems in high-performance computing platforms typically consist of a set of I/O nodes and file/storage servers that are accessed by multiple compute nodes. Applications executing on compute nodes issue streams of I/O requests to the underlying I/O nodes that in turn request data from multiple file/storage servers. In general, the storage architectures in high-end computing platforms can be multi-tiered with different types of shared resources through the entire storage stack. Consolidated resources in these systems usually include different layers of the I/O subsystem used to cache recently and/or frequently used data so that the number of I/O requests accessing the disk is reduced. They also include the shared I/O bandwidth and disk space that collectively service I/O requests from different applications. In order to reduce the gap between the computation and I/O speeds, effective utilization of system resources through the entire storage hierarchy is essential. Storage resources are also dynamically shared among different concurrently executing applications. Dynamically managing these distributed storage resources across multiple applications demands solutions to interesting questions such as, (i) how to allocate resources across competing applications?; (ii) what fraction of the allocation should come from each of the available nodes in the cluster for each application?; and (iii) how would these allocations adapt to dynamic modulations in resource requirements at runtime? When concurrently-executing applications contend for shared system resources (e.g., shared storage cache, I/O bandwidth), it brings forth interference among applications, which can lead to unpredictable system behavior. Providing predictable performance or Quality of Service (QoS) guarantees to applications from these systems, with complex interactions between different kinds of system resources is a challenging problem. In this context, in my research I have focused on techniques for effective management of dynamically shared storage system resources to provide predictable performance to concurrently-executing applications. The main contributions can be summarized as: i) proposed techniques for QoS decomposition and fufillment for efficiently managing multi-server storage caches using adaptive max-flow algorithm and feedback control theory; ii) using a combination of per-application latency model and linear programming model managing storage caches in both clients and servers in a multi-level multi-server storage architecture; iii) multiresource management using a strategy to control inter-application interference by dynamically adjusting the aggressiveness of the shared storage cache and I/O bandwidth accesses made by multiple, concurrently-executing applications, to perform a coordinated allocation of both storage cache and I/O bandwidth resources, while keeping the interplay between shared resources intact; iv) a heterogeneous staging storage architecture for the HPC I/O hierarchy that seamlessly aggregates both DRAM buffers as well as SSDs in a tiered architecture.