Empirical Study on Model Compression Techniques for Efficient Deployment of Large Language Models and Scalable Diffusion Models with Transformers

Supervisor: Prof. Luo Ping, Second examer: Prof. Kong Lingpeng

Student: Li Zhiqian

Large Models can be compressed into smaller-size models with Quantization!

Quantization is a model size reduction technique that converts model weights from high-precision floating-point representation to low-precision floating-point (FP) or integer (INT) representations, such as 8-bit or 4-bit.

Key Techniques in Model Compression

Apply them to Large Language Models(LLMs) and Scalable Diffusion Models with Transformers(DiTs)

Learnable Weight Clipping

Optimal dynamic clipping threshold for weight distribution

Learnable Equivalent Transformation

Channel-wise scaling and shifting for activation distribution -> outlier issue

   

  • SmoothQuant refer to Xiao, Guangxuan, et al. “Smoothquant: Accurate and efficient post-training quantization for large language models.” International Conference on Machine Learning. PMLR, 2023.

Piecewise Weight Quantization

Whether you’re building a website for your business, personal brand, or creative project, Frost is the perfect solution for anyone looking to launch a website quickly and efficiently.

.

Activations Clipping

•Top10 —-  The top-K method limits the range of activation values by keeping only the K largest (or smallest) values and discarding the rest.

•Clipping range —- The clip method limits the range of activation values by multiplying them with a clipping coefficient.

    Demo of DiTs’ structure

    DiT strcuture where highlight layers are quantized in Uniform method and others in piecewise method. Original picture refers to Peebles, William, and Saining Xie. “Scalable diffusion models with transformers.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

    Demo of Transformer block

    Learnable Weight Clipping in Transformer block. Original picture refers to Shao, Wenqi, et al. “Omniquant: Omnidirectionally calibrated quantization for large language models.” arXiv preprint arXiv:2308.13137 (2023).