Optimizing Cloud Efficiency: A Performance and Cost Perspective
Restricted (Penn State Only)
- Author:
- Huang, Lexiang
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- February 21, 2024
- Committee Members:
- Chitaranjan Das, Program Head/Chair
Linhai Song, Outside Unit & Field Member
Timothy Zhu, Chair & Dissertation Advisor
Anand Sivasubramaniam, Major Field Member
Bhuvan Urgaonkar, Major Field Member - Keywords:
- Cloud Computing
Performance Debugging
Cost Optimization
System Failures - Abstract:
- Cloud efficiency is becoming an important topic as the cloud scales. Cloud providers, both public and private, desire to provide highly performant services at a low cost. High cloud efficiency is achieved by optimizing cloud applications to take full advantage of the computing power and cost-effectiveness of the cloud. In this dissertation, we propose approaches to optimize application performance and cost running on cloud systems. For debugging performance issues, we describe tprof, a performance profiler via structural aggregation and automated analysis of distributed systems traces to detect performance inefficiencies. We also develop an approach to automatically instrument application code, conduct measurements, and generate bug reports. For reducing cost, we design Workload Intelligence, a novel cloud interface for dynamic bi-directional communication between workloads and the cloud platform. We demystify the critical workload characteristics that enable cloud optimizations for improving cloud efficiency. By punching holes through the current abstraction, the cloud can reduce its costs without violating any workload requirements and pass the savings to its users. Lastly, we illustrate that cloud systems running at a higher load are more vulnerable to getting stuck in a permanent overload. The vulnerability comes from organic optimizations of distributed systems (e.g., retries). We introduce a framework named metastable failures to understand the trade-off between cloud efficiency and vulnerability and present an analysis of real-world performance incidents that are the direct consequences.