Recent advancements in weight quantization allow us to run massive large language models on consumer hardware, like a LLaMA-30B model on an RTX 3090 GPU. This is possible thanks to novel 4-bit quantization techniques with minimal performance degradation, like GPTQ, GGML, and NF4.

In the previous article, we introduced naïve 8-bit quantization techniques and the excellent LLM.int8(). In this article, we will explore the popular GPTQ algorithm to understand how it works and implement it using the AutoGPTQ library.

Let’s start by introducing the problem we’re trying to solve. For every layer ℓ in the network, we want to find a quantized version Ŵₗ of the original weights Wₗ. This is called the layer-wise compression problem. More specifically, to minimize performance degradation, we want the outputs (ŴᵨXᵨ) of these new weights to be as close as possible to the original ones (WᵨXᵨ). In other words, we want to find:

Different approaches have been proposed to solve this problem, but we’re interested in the Optimal Brain Quantizer (OBQ) framework here.

This method is inspired by a pruning technique to carefully remove weights from a fully trained dense neural network (Optimal Brain Surgeon). It uses an approximation technique and provides explicit formulas for the best single weight w to remove and optimal update δ to adjust the set of remaining non-quantized weights F to make up for the removal:

where quant(w) is the weight rounding given by the quantization and H is the Hessian.

Using OBQ, we can quantize the easiest weight first and then adjust all remaining non-quantized weights to compensate for this precision loss. Then we pick the next weight to quantize, and so on.