Resource Constrained Transformer Architectures
Restricted (Penn State Only)
- Author:
- Lee, Chonghan
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- March 14, 2024
- Committee Members:
- Chitaranjan Das, Program Head/Chair
Vijaykrishnan Narayanan, Chair & Dissertation Advisor
Guoray Cai, Outside Unit & Field Member
Ting He, Major Field Member
John Sampson, Major Field Member - Keywords:
- AutoML
Transformer
HW-SW Codesign
NLP
Computer Vision
NAS
PIM - Abstract:
- The field of Artificial Intelligence (AI) has shown remarkable progress in a wide range of tasks including vision recognition, natural language understanding, video generation, and more. This progress has been achieved with Deep Learning where large Deep Neural Networks (DNNs) are trained on considerable quantities of data using advanced computational hardware. Especially, Transformer with a self-attention mechanism has been a breakthrough in many domains such as Natural Language Processing (NLP) and Computer Vision (CV). The model is designed to process sequential input data such as natural language or an ordered set of small patches of images. However, Transformers have quadratic compute and memory complexity with respect to the length of the input sequence resulting in enormous environmental and computational costs. Furthermore, it is challenging to deploy these powerful models on resource-limited hardware devices due to the exponentially increasing model size and computation. It is crucial to make AI more efficient with less engineering cost and less computational cost. There is a broad spectrum of hardware devices to deploy large neural networks ranging from cloud GPUs/TPUs to mobile devices. Efficient inference of large-scale models like Transformers on diverse hardware platforms requires a repetitive Neural Architecture Search (NAS) process to design compact versions of the models optimized for each platform. However, these compact models are conceived for certain scenarios where the inference budget is known a priori and the repeated engineering process to search and train each model with different resource constraints is intensive. In our first work, we address this issue with adaptive Transformer architecture that could be customized for each computational budget without additional training and achieves the optimal accuracy-efficiency tradeoffs in various NLP tasks. The adaptive transformer model is trained once with progressive pruning schemes. Once the model is trained, it could dynamically prune tokens and attention heads jointly from the input sequence to significantly reduce computational cost. The second work presents adaptive Vision Transformer architecture on Fine-grained Visual Classification (FGVC) to dynamically prune small image patches to eliminate a large number of redundant tokens in the later layers. The progressive pruning scheme results in an efficient yet powerful model that focuses on the objects and captures local discriminative features successfully in different images from various categories. Processing in Memory (PIM) architectures have been extensively explored as hardware accelerators for Transformers to reduce the data movement cost during the model inference. However, existing works lack the support for algorithmic optimizations such as token adaptive Transformers. The third work presents a software-hardware co-design PIM framework to incorporate support for token adaptive Transformer and maximize throughput and efficiency.