Towards Cost-Effective Resource Management Strategies for Distributed Deep Learning and Data Parallel Workloads on the Cloud

Open Access
- Author:
- Sharma, Aakash
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- August 30, 2024
- Committee Members:
- Chitaranjan Das, Program Head/Chair
Mahmut Kandemir, Major Field Member
Anton Nekrutenko, Outside Unit & Field Member
Chitaranjan Das, Chair of Committee
George Kesidis, Dissertation Advisor - Keywords:
- Cloud Computing
Distributed Systems
Resource Management
Deep Learning
Data-parallel Systems - Abstract:
- Cloud computing has become a ubiquitous part of every enterprise IT infrastructure owing to its flexibility and cost savings when compared to private data centers. For infrequent use, tenants can leverage the cloud to avoid the high acquisition cost of maintaining a private data center. Tenants use the cloud for its various hardware offerings spanning compute, storage and networking. They run a wide variety of workloads including GPU accelerated workloads such as deep learning and traditional data parallel workloads such as MapReduce. These workloads are typically run in a distributed setup due to the large amount of processing needed and in order to complete the execution in a timely manner. The objective of this dissertation is to lower the cost of running such large distributed workloads by employing cost-effective resource management strategies in the cloud. Towards this, the dissertation comprises of three intertwined tasks. The first task proposes techniques to reduce the cost of running CPU or I/O bound data parallel workloads on burstable hardware, while the second task introduces methods and insights for running GPU accelerated workloads such as distributed deep learning (DDL) cost effectively. As part of the first two tenant centric tasks, we evaluate various cloud hardware offerings such as burstable CPUs/disks, GPUs, interconnects and the network through a systematic methodology with the objective of finding specific characteristics that can be exploited to lower down cost. Hardware features such as burstability and communication latency are of particular interest. Further, we find workload characteristics that can be exploited to identify the most suitable cloud hardware for running them cost-effectively. Finally, as part of the third task, we propose a novel GPU scheduler for DDL which can reduce communication latency and queueing delays across multiple tenants. The scheduler achieves this through optimal consolidation of GPUs allocated to the various DDL jobs of the cloud. The three tasks collectively makes cloud computing more cost effective and resource efficient.