4-bit Quantization with GPTQ | Towards Data Science


Quantize your own LLMs using AutoGPTQ

Image by author

Recent advancements in weight quantization allow us to run massive large language models on consumer hardware, like a LLaMA-30B model on an RTX 3090 GPU. This is possible thanks to novel 4-bit quantization techniques with minimal performance degradation, like GPTQ, GGML, and NF4.

In the previous article, we introduced naïve 8-bit quantization techniques and the excellent LLM.int8(). In this article, we will explore the popular GPTQ algorithm to understand how it works and implement it using the AutoGPTQ library.

You can find the code on Google Colab and GitHub.

Let’s start by introducing the problem we’re trying to solve. For every layer ℓ in the network, we want to find a quantized version Ŵₗ of the original weights Wₗ. This is called the layer-wise compression problem. More specifically, to minimize performance degradation, we want the outputs (ŴXᵨ) of these new weights to be as close as possible to the original ones (WXᵨ). In other words, we want to find:

Different approaches have been proposed to solve this problem, but we’re interested in the Optimal Brain Quantizer (OBQ) framework here.

This method is inspired by a pruning technique to carefully remove weights from a fully trained dense neural network (Optimal Brain Surgeon). It uses an approximation technique and provides explicit formulas for the best single weight w𐞥 to remove and optimal update δꟳ to adjust the set of remaining non-quantized weights F to make up for the removal:

where quant(w) is the weight rounding given by the quantization and Hꟳ is the Hessian.

Using OBQ, we can quantize the easiest weight first and then adjust all remaining non-quantized weights to compensate for this precision loss. Then we pick the next weight to quantize, and so on.



Source link

Leave a Comment