Empirical Study on Model Compression Techniques for Efficient Deployment of Large Language Models and Scalable Diffusion Models with Transformers
Supervisor: Prof. Luo Ping, Second examer: Prof. Kong Lingpeng
Student: Li Zhiqian
Large Models can be compressed into smaller-size models with Quantization!
Quantization is a model size reduction technique that converts model weights from high-precision floating-point representation to low-precision floating-point (FP) or integer (INT) representations, such as 8-bit or 4-bit.
![](https://wp2023.cs.hku.hk/fyp23056/wp-content/uploads/sites/57/qu.png)
Key Techniques in Model Compression
Apply them to Large Language Models(LLMs) and Scalable Diffusion Models with Transformers(DiTs)
![](https://wp2023.cs.hku.hk/fyp23056/wp-content/uploads/sites/57/Screen-Shot-2024-05-20-at-12.20.12-PM.png)
Learnable Weight Clipping
Optimal dynamic clipping threshold for weight distribution
Learnable Equivalent Transformation
Channel-wise scaling and shifting for activation distribution -> outlier issue
![](https://wp2023.cs.hku.hk/fyp23056/wp-content/uploads/sites/57/Screen-Shot-2024-05-20-at-12.27.00-PM.png)
- SmoothQuant refer to Xiao, Guangxuan, et al. “Smoothquant: Accurate and efficient post-training quantization for large language models.” International Conference on Machine Learning. PMLR, 2023.
![](https://wp2023.cs.hku.hk/fyp23056/wp-content/uploads/sites/57/piece.png)
Piecewise Weight Quantization
Whether you’re building a website for your business, personal brand, or creative project, Frost is the perfect solution for anyone looking to launch a website quickly and efficiently.
.
Activations Clipping
•Top10 —- The top-K method limits the range of activation values by keeping only the K largest (or smallest) values and discarding the rest.
•Clipping range —- The clip method limits the range of activation values by multiplying them with a clipping coefficient.
![](https://wp2023.cs.hku.hk/fyp23056/wp-content/uploads/sites/57/Screen-Shot-2024-05-20-at-12.33.38-PM.png)