New solution and proof of GPTQ quantization algorithm

A new solution to GPTQ quantization algorithm

GPTQ quantizes each row in parallel, one weight at a time.

0. Introduction

The papers Optimal Brain Compression and GPTQ introduced a post-training quantization method that is one of the most popular quantization methods at the moment. The algorithm quantizes the rows of a matrix independently. In each row, a weight is quantized at a time and the remaining unquantized weights are modified to minimize the quantization error. The quantization error and the modified weights both have closed form solutions provided in the papers above. I have also provided a proof of these equations here .

Here I will provide a novel way to approach the problem and come up with a different solution. It has an expression that is easier to comprehend and make sense of geometrically. Finally I will show that the solution is equivalent to the one provided in the papers above.

1. Problem setup

Let \( W \in R^n \) be a vector of weights that needs to be quantized, and \( dW \) be a perturbation of weights. Here the first weight has been quantized already \( dW_0 = W_{0q} - W_0 \) and we want to find out the perturbation for the remaining weights \( dW_{1,2,...,n} \) that minimizes the quantization error (which is the mean squared error).

\[ \begin{align} f(\amber{dW}) &= \|(W + \amber{dW})X - WX\|^2 \\ &= \cancel{J dW} + \frac{1}{2} \amber{dW^T} H \amber{dW} + \cancel{\mathcal{O}(dW^3)} \\ &= \frac{1}{2} \amber{dW^T} H \amber{dW} \\ \end{align} \]

where \( dW \) is:

\( \begin{bmatrix} \left. dW_0 \ \ \ \right\}{\text{constant}} \\ \left. \amber{dW_{1:n}} \ \right\}{\amber{\text{variables}}} \end{bmatrix} \)

We want to find how to modify the unquantized weights after the first weight has been quantized to minimize the error.

2. Solution

In the original proof, the already quantized weight is used as a constraint to form a Lagrangian. Here we will use a different approach. The first thing we do is break down the Hessian into its Cholesky decomposition: \( H = LL^T \) which immediately gives us (ignoring the constant 0.5):

\[ \begin{align} f(\amber{dW}) &= \| L^T \amber{dW} \|^2 \\ &= \| \underbrace{L_{:,0}^T \ dW_0}_{a} + \underbrace{L_{:,1:}^T}_{M} \ \underbrace{\amber{dW_{1:}}}_{\amber{x}} \|^2 \\ g(\amber{x}) &= \| \underbrace{a}_{\in R^n} + \underbrace{M}_{\in R^{n \times n-1}} \ \ \underbrace{\amber{x}}_{\amber{\in R^{n-1}}} \|^2 \end{align} \]

Thus we simplified the problem to find \( x \) which minimizes the squared norm of a sum of vectors. I wrote about this problem and its solution here . Thus the solution to our problem is given by:

\( \amber{x} = \amber{dW_{1:}} = -(M^TM)^{-1} M^T a \)

In addition, we also get an insight into what the optimal weights do geometrically:

The optimal weights \( dW_{1:} \) negate the projection of \( a \) onto the column space of \( M \).

Next, let us express \( M \) and \( a \) in terms of \( H \) and \( dW \). Note that the i-th column of \( M \) is i+1-th column of \( L^T \) or the i+1-th row of \( L \). In other words:

\( M = L_{:,1:}^T \Rightarrow M^T = L_{1:,:} \)

Thus, \( M^TM \) is just the bottom right \( n-1 \times n-1 \) block of \( H \)

The upper triangular matrix \( L^T \) and \( M \)

and the vector \( M^Ta \) is:

\( \begin{bmatrix} \amber{l_{10}} & l_{11} & ... & 0 \\ \amber{l_{20}} & l_{21} & ... & 0 \\ ... & ... & ... & 0 \\ \amber{l_{n0}} & l_{n1} & ... & l_{nn} \end{bmatrix} \)\( \begin{bmatrix} dW_0l_{00} \\ 0 \\ ... \\ 0 \end{bmatrix} \)\(= \underbrace{dW_0}_{(w_{0q} - w_0)}l_{00} \)\( \begin{bmatrix} \amber{l_{10}} \\ \amber{l_{20}} \\ ... \\ \amber{l_{n0}} \end{bmatrix} \)\( = \amber{H_{1:,0}} \ dW_0 \)

Finally, note that \( H_{00} = l_{00}^2 \)

\( H = \)\( \begin{bmatrix} \underbrace{H_{0,0}}_{= l_{00}^2} & H_{1:,0}^T \\ \rose{\underbrace{H_{1:,0}}_{= \frac{M^Ta}{dW_0}}} & \underbrace{\amber{H_{1:,1:}}}_{\amber{= M^TM}} \end{bmatrix} \)

We re-write the final expression as:

\( dW_{1:} = -\amber{(H_{1:,1:})^{-1}} \rose{H_{1:,0}} \ (w_{0q} - w_0) \)

3. Equivalence to the original solution

The original solution is given by:

\( dW_{1:} = \frac{(w_{0q} - w_0)}{(H^{-1})_{00}} \ (H^{-1})_{1:,0}\)

Note that in this expression, the matrix blocks come from \( H^{-1} \) whereas in the expression above, the matrix blocks come from \( H \). To connect these two, we express \( H^{-1} \) in terms of \( H \) using the Schur complement:

\( H^{-1} = \)\( \begin{bmatrix} \underbrace{S^{-1}}_{(H^{-1})_{00}} & ... \\ & \\ \underbrace{-\amber{(H_{1:,1:})^{-1}} \rose{H_{1:,0}} \ S^{-1}}_{(H^{-1})_{1:,0}} & ... \end{bmatrix} \)

Substituting this into the expression above:

\( \frac{(H^{-1})_{1:,0}}{(H^{-1})_{00}} = -\amber{(H_{1:,1:})^{-1}} \rose{H_{1:,0}} \)

This proves the equivalence. The exact value of \( S \) does not matter here since it will be canceled out (assuming it is non-zero) but for sake of completion, \( S = H_{00} - H_{01}H_{11}^{-1}H_{10} \).

Minimization of squared norm of a vector sum

Inverse of a matrix after row and column removal