CO-LOCATING COMPUTE AND MEMORY ACCESS IN DEEP LEARNING RECOMMENDATION MODEL INFERENCE

Restricted (Penn State Only)
- Author:
- Kalagi, Vishwas
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Master of Science
- Document Type:
- Master Thesis
- Date of Defense:
- March 29, 2023
- Committee Members:
- Chitaranjan Das, Program Head/Chair
Chitaranjan Das, Thesis Advisor/Co-Advisor
Mahmut Taylan Kandemir, Committee Member
Kamesh Madduri, Committee Member - Keywords:
- Deep Learning Recommendation Models
hyperthreading
colocation
CPU
Inference
compute intensive
memory intensive
memory bottleneck - Abstract:
- Neural personalized recommendation models are widely used in various data center applications, but their computational complexity poses a challenge due to their two significant computational layers, namely embedding layers and multi-layer perceptron (MLP) layers, which are memory intensive and compute-intensive, respectively. This study did a thorough analysis of recent recommendation models and data sets on central processing units (CPUs) and found that memory limitations are still the main reason why CPUs can't work as fast as they could. To address this issue, a new approach is proposed to hide the memory bottleneck of the memory-intensive stage by concurrently executing it with the compute-intensive MLP stage of the model. Since bottom MLP and embedding stages are independent and do not compete for the same resource, they are ideal candidates for hyperthreading workloads. This study proposes to exploit hyperthreading to carry out the execution of the embedding stage and bottom MLP stages independently on the same physical core. Experiments showed that this method improves the model's performance significantly, achieving a performance boost of around 35% in single-core scenarios and 27% in multi-core scenarios compared to the currently followed sequential execution of inference workloads. Compared to other approaches that require significant hardware changes, the proposed approach is more flexible and can adapt to rapidly changing model designs.