Deep learning on Mobile Devices with Neural Processing Units

Restricted (Penn State Only)
- Author:
- Tan, Tianxiang
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- June 07, 2022
- Committee Members:
- Bhuvan Urgaonkar, Major Field Member
Suhang Wang, Outside Unit & Field Member
Mahanth Gowda, Major Field Member
Guohong Cao, Chair & Dissertation Advisor
Chitaranjan Das, Program Head/Chair - Keywords:
- Mobile computing
Edge computing
Deep Learning - Abstract:
- Deep Neural Networks (DNN) have been successfully applied to various computer vision and natural language processing problems. Although DNNs can provide better results, they suffer from high computational overhead which means long delay and more energy consumption when running on mobile devices. To address this problem, many companies have developed dedicated Neural Processing Units (NPUs) for accelerating deep learning on mobile devices. NPU can significantly reduce the running time of these DNNs with much less energy, however it incurs accuracy loss which poses new research challenges. The goal of this dissertation is to address these challenges by developing techniques to improve the performance and the energy efficiency of running DNNs on mobile devices with NPU. First, we propose techniques to decompose the DNN architecture into different layers running on CPU and NPU to maximize accuracy or minimize processing time based on the application requirement. Based on the delay and the accuracy requirements of the applications, we study two problems: Max-Accuracy where the goal is to maximize the accuracy under some time constraint, and Min-Time where the goal is to minimize the processing time while ensuring the accuracy is above a certain threshold. To solve these problems, we propose heuristic based algorithms which are simple but only search a small number of layer combinations (i.e., where to run which DNN model layers). To further improve the performance, a machine learning based model partition algorithm is developed which searches more layer combinations and considers both accuracy loss and processing time simultaneously. Second, we propose techniques to improve the performance of running DNNs on mobile devices while avoiding the overheating problem. Compared to CPU, mobile GPU can be leveraged to improve performance. However, after running DNNs for a short period of time, the mobile device may become overheated and the processors are forced to reduce the clock speed, significantly reducing the processing speed. Compared to GPU, NPU is much faster and more energy efficient, but with lower accuracy due to the use of low precision floating-point numbers. We propose to combine these two approaches to improve performance by studying the thermal-aware scheduling problem, where the goal is to achieve a better tradeoff between processing time and accuracy while ensuring that the mobile device is not overheated. To solve the problem, we first propose a heuristic-based scheduling algorithm to determine when to run DNNs on GPU and when to run DNNs on NPU based on the current states of the mobile device, and then propose a deep reinforcement learning based scheduling algorithm to further improve performance. Third, we propose techniques to support deep learning applications through edge processing and NPU in mobile. The major challenge is to determine when to offload the computation and when to use NPU. Based on the processing time and accuracy requirement of the mobile application, we study three problems: Max-Accuracy where the goal is to maximize the accuracy under some time constraints, Max-Utility where the goal is to maximize the utility which is a weighted function of processing time and accuracy, and Min-Energy where the goal is to minimize the energy under some time and accuracy constraints. We formulate them as integer programming problems and propose heuristics based solutions. Finally, we further improve the performance of offloading by leveraging the confidence score of running DNNs on mobile devices. If the confidence score is higher than a threshold, the classification result on NPU is most likely accurate and can be directly used; otherwise, the data is offloaded for further processing to improve the accuracy. However, the confidence score of many advanced DNNs cannot accurately estimate the classification results, and then may not be effective for making offloading decisions. We propose confidence score calibration techniques, formulate the confidence based offloading problem where the goal is to maximize accuracy under some time constraint, and propose an adaptive solution that determines which frames to offload at what resolution based on the confidence score and the network condition. Through real implementations and extensive evaluations, we demonstrate that the proposed solution can significantly outperform other approaches.