GPTQ: Optimizing 4-Bit Weight Quantization for LLMs

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Language AI Handbook

Discover how GPTQ optimizes weight quantization using Hessian-based error compensation to compress LLMs to 4 bits while maintaining near-FP16 accuracy.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

GPTQLink Copied

In the previous chapters on INT8 and INT4 quantization, we explored straightforward approaches to weight compression: round each weight to the nearest representable value in the target format. While these methods work reasonably well, they treat each weight independently, ignoring the complex interactions between weights in a neural network. A small rounding error in one weight might be catastrophic for model quality, while a larger error in another weight might barely matter.

GPTQ (GPT Quantization) takes a fundamentally different approach. Rather than treating quantization as a simple rounding problem, GPTQ frames it as an optimization problem: given a layer's weights, find the quantized values that minimize the layer's output error. The key insight is that when you quantize one weight, you can partially compensate for the resulting error by adjusting the remaining weights before they too are quantized.

This compensation mechanism, combined with algorithmic optimizations, allows GPTQ to achieve low quantization error. Models quantized with GPTQ to 4 bits often perform nearly as well as their full-precision counterparts, enabling LLMs with tens of billions of parameters to run on consumer GPUs. GPTQ was one of the first methods to make models like LLaMA-65B practically usable on hardware that would otherwise be unable to load them.

The Layer-Wise Reconstruction ObjectiveLink Copied

To understand how GPTQ approaches quantization, we must first establish what it means to quantize well. The fundamental question is: what objective should we optimize? One natural answer might be to minimize the difference between original and quantized weights directly. However, this ignores a crucial insight: not all weight errors are equally harmful. What truly matters is how quantization affects the layer's output, because the output is what downstream computations depend upon.

GPTQ operates on one layer at a time, treating each layer's quantization as an independent optimization problem. This layer-wise decomposition is both a practical necessity and a reasonable approximation. Consider a linear layer with weights $\mathbf{W} \in \mathbb{R}^{d_{out} \times d_{in}}$ and inputs $\mathbf{X} \in \mathbb{R}^{d_{in} \times n}$ , where $n$ represents the number of calibration tokens we use to estimate the layer's behavior. The goal is to find quantized weights $\hat{\mathbf{W}}$ that minimize the reconstruction error:

\mathcal{L}(\hat{\mathbf{W}}) = ||\mathbf{W}\mathbf{X} - \hat{\mathbf{W}}\mathbf{X}||_F^2

where:

$\mathcal{L}(\hat{\mathbf{W}})$ : the reconstruction error (loss) for the quantized weights
$\mathbf{W}$ : the original high-precision weight matrix
$\hat{\mathbf{W}}$ : the quantized weight matrix
$\mathbf{X}$ : the input matrix (calibration data)
$||\cdot||_F$ : the Frobenius norm (square root of the sum of squared elements)

This loss function measures how much the layer's output changes due to quantization. Notice that we are comparing outputs, not weights directly. This distinction is essential: a weight that interacts with large input values will have a greater impact on the output than a weight that sees only small inputs. The Frobenius norm provides a natural way to aggregate these output differences into a single scalar that we can minimize.

Why focus on layer outputs rather than the final model loss? Computing gradients through the entire model for every quantization decision would be prohibitively expensive. Each quantization decision would require a full forward and backward pass, and with millions of weights to quantize, this approach simply does not scale. The layer-wise approach makes GPTQ tractable: we only need to run a forward pass through the model once to collect each layer's inputs, then we can quantize each layer independently. This factorization reduces a global optimization problem into many smaller, manageable subproblems.

The reconstruction error can be rewritten in a form that reveals important mathematical structure. This reformulation exposes the role of input correlations in determining which weights matter most. Expanding the Frobenius norm using the identity $||\mathbf{A}||_F^2 = \text{Tr}(\mathbf{A}\mathbf{A}^T)$ :

\begin{aligned} \mathcal{L}(\hat{\mathbf{W}}) &= ||(\mathbf{W} - \hat{\mathbf{W}})\mathbf{X}||_F^2 \\ &= \text{Tr}\left[ ((\mathbf{W} - \hat{\mathbf{W}})\mathbf{X}) ((\mathbf{W} - \hat{\mathbf{W}})\mathbf{X})^T \right] && \text{(apply identity: } ||\mathbf{A}||_F^2 = \text{Tr}(\mathbf{A}\mathbf{A}^T)\text{)} \\ &= \text{Tr}\left[(\mathbf{W} - \hat{\mathbf{W}})\mathbf{X}\mathbf{X}^T(\mathbf{W} - \hat{\mathbf{W}})^T\right] && \text{(expand transpose: } (\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T\text{)} \end{aligned}

where:

$\text{Tr}[\cdot]$ : the trace operator (sum of diagonal elements)
$(\mathbf{W} - \hat{\mathbf{W}})$ : the weight error matrix
$\mathbf{X}^T$ : the transpose of the input matrix

The trace operation sums the diagonal elements of the resulting matrix. What emerges from this manipulation is that the loss depends on the weight errors not in isolation, but as weighted by the matrix $\mathbf{X}\mathbf{X}^T$ . This matrix captures how the inputs correlate with each other: if two input dimensions tend to activate together, errors in their corresponding weights will interact.

Define $\mathbf{H} = 2\mathbf{X}\mathbf{X}^T$ , which we call the Hessian matrix (we'll explain why this name is appropriate shortly). The loss becomes:

\mathcal{L}(\hat{\mathbf{W}}) = \frac{1}{2}\text{Tr}\left[(\mathbf{W} - \hat{\mathbf{W}})\mathbf{H}(\mathbf{W} - \hat{\mathbf{W}})^T\right]

where:

$\text{Tr}[\cdot]$ : the trace operator
$(\mathbf{W} - \hat{\mathbf{W}})$ : the weight error matrix
$\mathbf{H}$ : the Hessian matrix ( $2\mathbf{X}\mathbf{X}^T$ ) that captures input correlations
$\frac{1}{2}$ : a scaling factor arising from the definition of $\mathbf{H}$

This quadratic form indicates that the loss is a weighted sum of squared errors, where the weighting comes from the Hessian matrix. Errors in weights that correspond to highly active or correlated inputs are penalized more heavily than errors in weights that rarely contribute to the output.

Since the rows of $\mathbf{W}$ don't interact in this expression (each output dimension is computed independently as a dot product with the inputs), we can optimize each row separately. This further simplifies our problem: instead of optimizing over all $d_{out} \times d_{in}$ weights simultaneously, we can solve $d_{out}$ independent problems, each involving only $d_{in}$ weights. For a single row $\mathbf{w} \in \mathbb{R}^{d_{in}}$ :

\mathcal{L}(\hat{\mathbf{w}}) = \frac{1}{2}(\mathbf{w} - \hat{\mathbf{w}})\mathbf{H}(\mathbf{w} - \hat{\mathbf{w}})^T

where:

$\mathbf{w}$ : a single row of the original weight matrix
$\hat{\mathbf{w}}$ : the corresponding row of quantized weights
$\mathbf{H}$ : the Hessian matrix (shared across all rows)
$(\mathbf{w} - \hat{\mathbf{w}})$ : the error vector for this row

This weighted squared error gives us the foundation for GPTQ's optimization strategy. The quadratic structure of this loss function is crucial because it enables closed-form solutions for optimal weight updates, as we will see in the following sections.

The Hessian and Its RoleLink Copied

The matrix $\mathbf{H} = 2\mathbf{X}\mathbf{X}^T$ is called the Hessian because it equals the second derivative of the reconstruction loss with respect to the weights. This naming reflects a connection to optimization theory. The Hessian matrix characterizes the local curvature of the loss landscape, telling us how rapidly the loss changes as we move in different directions through weight space.

To see why this matrix represents second derivatives, consider the loss for a single output neuron. We want to find weights that make the neuron's output match what it would produce with the original weights:

\mathcal{L}(\mathbf{w}) = \sum_{i=1}^{n} (\mathbf{w} \cdot \mathbf{x}_i - y_i)^2

where:

$\mathbf{w}$ : the weight vector being optimized
$\mathbf{x}_i$ : the $i$ -th input vector from the calibration set
$y_i$ : the target output scalar (computed using the original weights $\mathbf{w}^*$ )
$n$ : the total number of calibration tokens

This is a standard least-squares objective. To find its minimum, we compute the gradient by differentiating with respect to each weight component. The gradient is:

\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\sum_{i=1}^{n} (\mathbf{w} \cdot \mathbf{x}_i - y_i)\mathbf{x}_i

where:

$\frac{\partial \mathcal{L}}{\partial \mathbf{w}}$ : the gradient vector of the loss with respect to the weights
$\mathbf{w}$ : the weight vector
$\mathbf{x}_i$ : the $i$ -th input vector
$y_i$ : the target output scalar

Taking the derivative once more, we obtain the Hessian, which tells us how the gradient itself changes as we modify the weights. And the Hessian (second derivative) is:

\mathbf{H} = \frac{\partial^2 \mathcal{L}}{\partial \mathbf{w}^2} = 2\sum_{i=1}^{n} \mathbf{x}_i \mathbf{x}_i^T = 2\mathbf{X}\mathbf{X}^T

where:

$\mathbf{H}$ : the Hessian matrix
$\frac{\partial^2 \mathcal{L}}{\partial \mathbf{w}^2}$ : the second derivative of the loss function
$\mathbf{x}_i$ : the $i$ -th input vector
$\mathbf{X}$ : the matrix of all calibration inputs

The Hessian captures the curvature of the loss landscape around the current weights. Each element of this matrix has a specific interpretation that guides the quantization process. The diagonal elements $H_{qq}$ indicate how sensitive the loss is to changes in weight $w_q$ . A large value means that small perturbations to this weight cause large changes in the output, making this weight "important" in the sense that we must quantize it carefully. The off-diagonal elements $H_{qj}$ capture interactions between weights: they tell us how changes in one weight affect the optimal value of another. When two weights have a large off-diagonal element, their quantization errors are not independent; an error in one can be partially offset by adjusting the other.

Hessian as Input Statistics

The Hessian $\mathbf{H} = 2\mathbf{X}\mathbf{X}^T$ is simply a scaled covariance matrix of the layer's inputs. We don't need access to labels or backpropagation; we just need to observe what inputs flow through the layer during a forward pass on calibration data.

This observation has significant practical implications. Computing the Hessian requires only a forward pass through the network on calibration data. We never need to differentiate through the model or compute any target labels. The Hessian emerges naturally from the statistics of the inputs that the layer observes during normal operation. This makes GPTQ a true post-training quantization method: we take a pre-trained model, run it on some representative data, and use the resulting statistics to guide quantization.

Optimal Brain QuantizationLink Copied

GPTQ builds on a framework called Optimal Brain Quantization (OBQ), which itself descends from classical work on neural network pruning. The historical connection is illuminating: pruning and quantization are closely related problems. In pruning, we set certain weights exactly to zero. In quantization, we round weights to the nearest value in a discrete set. Both operations introduce error, and in both cases we want to minimize the impact on the network's output.

The core insight of OBQ is that when you quantize one weight, you can compute the optimal adjustment to all remaining weights that minimizes the resulting increase in loss. This is not a heuristic or approximation; given the quadratic structure of our loss function, there exists a closed-form formula for the best possible compensation.

Suppose we've decided to quantize weight $w_q$ to value $\hat{w}_q$ . We want to adjust the remaining weights $\mathbf{w}_F$ (where $F$ denotes the set of weights not yet quantized) to minimize:

\mathcal{L}(\hat{w}_q, \mathbf{w}_F + \delta\mathbf{w}_F) = \frac{1}{2}(\mathbf{w} - \hat{\mathbf{w}})\mathbf{H}(\mathbf{w} - \hat{\mathbf{w}})^T

where:

$\mathbf{w}$ : the original weight vector
$\hat{\mathbf{w}}$ : the quantized weight vector (incorporating both the quantized weight $\hat{w}_q$ and the adjustment $\delta\mathbf{w}_F$ )
$\mathbf{H}$ : the Hessian matrix
$\delta\mathbf{w}_F$ : the optimal adjustment vector we want to find for the remaining weights
$\hat{w}_q$ : the quantized value for weight $q$

The optimization proceeds as follows. We have committed to a particular quantized value $\hat{w}_q$ for weight $q$ , which introduces an error $w_q - \hat{w}_q$ . The question is: how should we modify the remaining weights to absorb as much of this error as possible? Because our loss is quadratic, we can differentiate with respect to $\delta\mathbf{w}_F$ , set the result to zero, and solve for the optimal adjustment.

Taking the derivative and setting it to zero yields the optimal update:

\delta\mathbf{w}_F = -\frac{w_q - \hat{w}_q}{[\mathbf{H}^{-1}]_{qq}} \cdot [\mathbf{H}^{-1}]_{F,q}

where:

$\delta\mathbf{w}_F$ : the optimal adjustment vector for the remaining unquantized weights
$w_q - \hat{w}_q$ : the quantization error for the current weight $q$
$[\mathbf{H}^{-1}]_{qq}$ : the diagonal element of the inverse Hessian corresponding to weight $q$
$[\mathbf{H}^{-1}]_{F,q}$ : the elements of the $q$ -th column of the inverse Hessian corresponding to the remaining weights $F$

Let's unpack this formula to build intuition for what it means. The numerator $w_q - \hat{w}_q$ is the quantization error for weight $q$ , simply the difference between what we wanted and what we got. The denominator $[\mathbf{H}^{-1}]_{qq}$ represents the inverse curvature, which measures how flexible the loss landscape is with respect to weight $q$ ; a larger value implies the model is less sensitive to errors in this weight, which means errors can be more easily absorbed by other weights. When this value is large, the penalty for error in weight $q$ is small, making compensation easier. The vector $[\mathbf{H}^{-1}]_{F,q}$ determines how this error should be distributed across remaining weights based on their correlations with weight $q$ . Weights that are strongly correlated with $w_q$ (in terms of their effect on the output) receive larger adjustments.

The resulting increase in loss from quantizing weight $q$ (after optimal compensation) is:

\Delta\mathcal{L}_q = \frac{(w_q - \hat{w}_q)^2}{2[\mathbf{H}^{-1}]_{qq}}

where:

$\Delta\mathcal{L}_q$ : the increase in reconstruction error caused by quantizing weight $q$
$(w_q - \hat{w}_q)^2$ : the squared quantization error
$[\mathbf{H}^{-1}]_{qq}$ : the diagonal element of the inverse Hessian (representing the "stiffness" of weight $q$ )

This formula tells us exactly how much each quantization decision costs, accounting for optimal compensation. The cost depends on two factors: the squared quantization error and the inverse curvature. Weights where $[\mathbf{H}^{-1}]_{qq}$ is large can be quantized with relatively low cost even if the rounding error is substantial. Conversely, weights with small $[\mathbf{H}^{-1}]_{qq}$ values are stiff: even small errors cause significant loss increases.

The GPTQ AlgorithmLink Copied

The naive OBQ approach would quantize weights one at a time, choosing at each step the weight whose quantization causes the smallest loss increase. While this strategy appears straightforward, its computational cost is prohibitive. Each step requires examining all remaining weights to find the best candidate, and after quantizing each weight, we must update the inverse Hessian. This requires $O(d_{in}^2)$ operations per weight, yielding $O(d_{in}^3)$ complexity per row and $O(d_{in}^4)$ for the entire layer. That's far too slow for large models where $d_{in}$ can be in the thousands.

GPTQ makes three key modifications that reduce the total complexity to $O(d_{in}^3)$ while maintaining nearly identical accuracy. The insight is that the optimal quantization order matters less than one might expect. What matters more is performing the compensation correctly.

Column-wise processing: Instead of quantizing weights one at a time in optimal order, GPTQ processes all weights in a fixed column order. Alternatively, it uses a smart ordering based on Hessian diagonals, called ActOrder. Processing columns together enables vectorization, allowing us to quantize all rows of the weight matrix simultaneously for a given column.

Lazy batch updates: Rather than updating all remaining weights after each quantization, GPTQ accumulates updates in blocks and applies them periodically. This improves cache efficiency dramatically because memory access patterns become more predictable and localized.

Cholesky-based inverse updates: Computing and updating the full inverse Hessian is expensive. GPTQ uses the Cholesky decomposition of the inverse Hessian, which can be updated efficiently as columns are processed. The triangular structure of the Cholesky factor allows updates in $O(d_{in})$ time per column after an initial $O(d_{in}^3)$ factorization.

Here's the algorithm in detail:

Collect calibration data: Run a small set of examples through the model, recording each layer's inputs $\mathbf{X}$ .
Compute the Hessian: For each layer, compute $\mathbf{H} = 2\mathbf{X}\mathbf{X}^T$ and its inverse (or Cholesky factor).
Process each row of the weight matrix: For each row $\mathbf{w}$ :

a. For each column $q$ from 0 to $d_{in} - 1$ :
- Quantize: $\hat{w}_q = \text{quantize}(w_q)$
- Compute error: $\delta_q = w_q - \hat{w}_q$
- Update remaining weights: $w_{q+1:} \leftarrow w_{q+1:} - \frac{\delta_q}{[\mathbf{H}^{-1}]_{qq}} \cdot [\mathbf{H}^{-1}]_{q+1:,q}$
b. Store the quantized row $\hat{\mathbf{w}}$
Replace the layer's weights with their quantized values plus any required metadata (scales, zero points).

The order in which columns are processed affects accuracy. The default left-to-right order works reasonably well, but ActOrder (processing columns in decreasing order of Hessian diagonal) often improves results by quantizing the most important weights first, when there are still many remaining weights to absorb compensation.

Mathematical Details of the UpdateLink Copied

The inverse Hessian update is a critical mathematical component that enables efficient processing. After quantizing column $q$ , we need to update the inverse Hessian to reflect that column $q$ is no longer a free variable. The weight at column $q$ has been fixed to its quantized value; it can no longer be adjusted to compensate for future quantization errors.

Let $\mathbf{H}_F^{-1}$ be the inverse Hessian restricted to the remaining (unquantized) weights. After removing column/row $q$ , the new inverse is:

[\mathbf{H}_{F \setminus q}^{-1}]_{ij} = [\mathbf{H}_F^{-1}]_{ij} - \frac{[\mathbf{H}_F^{-1}]_{iq} [\mathbf{H}_F^{-1}]_{qj}}{[\mathbf{H}_F^{-1}]_{qq}}

where:

$[\mathbf{H}_{F \setminus q}^{-1}]_{ij}$ : the updated inverse Hessian element at row $i$ , column $j$
$[\mathbf{H}_F^{-1}]_{ij}$ : the current inverse Hessian element at row $i$ , column $j$
$[\mathbf{H}_F^{-1}]_{iq}, [\mathbf{H}_F^{-1}]_{qj}$ : elements from the $q$ -th column/row of the current inverse Hessian
$[\mathbf{H}_F^{-1}]_{qq}$ : the diagonal element corresponding to weight $q$
$i, j$ : indices of the remaining unquantized weights

This formula is a rank-one update that removes the row and column corresponding to $q$ . The intuition is that by fixing weight $q$ , we eliminate one degree of freedom from our optimization problem. The relationships between remaining weights must be adjusted to account for this lost flexibility. Computing this update naively would cost $O(d^2)$ per column, but the Cholesky decomposition enables a much more efficient approach.

The Cholesky factorization gives us $\mathbf{H}^{-1} = \mathbf{L}\mathbf{L}^T$ where $\mathbf{L}$ is lower triangular. This factorization is particularly valuable because the triangular structure directly encodes the sequential dependencies between weights. The $q$ -th column of $\mathbf{L}$ directly gives us the coefficients needed for the weight update, and removing column $q$ corresponds to simply moving to the next column of $\mathbf{L}$ . This reduces the per-column update cost to $O(d)$ after the initial $O(d^3)$ factorization.

Calibration DataLink Copied

GPTQ requires calibration data to estimate the Hessian. The quality and quantity of this data affects quantization results:

Quantity: GPTQ typically uses 128-1024 samples. More samples give a better Hessian estimate but increase computation.
Representativeness: The calibration data should resemble the data the model will see at inference. Using random text to calibrate a code model yields worse results than using code samples.
Sequence length: Longer sequences provide more tokens per sample. Most implementations use 2048 tokens per sequence.

The calibration data doesn't need labels; GPTQ only performs forward passes to collect layer activations. Common choices include random samples from C4 (web text), WikiText (Wikipedia), or domain-specific data for specialized models.

Group QuantizationLink Copied

Standard per-tensor or per-channel quantization uses a single scale and zero point for many weights. GPTQ often employs group quantization, where weights are divided into groups (commonly 128 weights each) that share quantization parameters.

Group quantization provides a middle ground between accuracy and overhead:

Per-tensor quantization: One scale for all weights. Minimum storage overhead but potentially high quantization error if weight distributions vary.
Per-group quantization: Separate scales for groups of 128 weights. Small overhead (one FP16 scale per 128 INT4 weights = 0.125 bits/weight) with substantially better accuracy.
Per-weight quantization: Individual scales per weight. Maximum accuracy but impractical overhead.

With group size 128 and 4-bit weights, the effective storage is approximately 4.125 bits per weight, a tiny overhead for significant accuracy gains.

Implementation with AutoGPTQLink Copied

Let's see GPTQ in practice using the AutoGPTQ library, which provides an efficient implementation of the algorithm.

In[3]:

Code

First, we'll load a small model to demonstrate the quantization process:

In[4]:

Code

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load a small model for demonstration
model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="cpu"
)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load a small model for demonstration
model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="cpu"
)

Let's examine the original model's memory footprint:

In[5]:

Code

def count_parameters(model):
    """Count total parameters and calculate memory in FP16."""
    total_params = sum(p.numel() for p in model.parameters())
    memory_fp16_mb = total_params * 2 / (1024**2)  # 2 bytes per FP16
    return total_params, memory_fp16_mb


total_params, memory_mb = count_parameters(model)
int4_memory_mb = memory_mb / 4

def count_parameters(model):
    """Count total parameters and calculate memory in FP16."""
    total_params = sum(p.numel() for p in model.parameters())
    memory_fp16_mb = total_params * 2 / (1024**2)  # 2 bytes per FP16
    return total_params, memory_fp16_mb


total_params, memory_mb = count_parameters(model)
int4_memory_mb = memory_mb / 4

Out[6]:

Console

Total parameters: 125,239,296
FP16 memory: 238.88 MB
Estimated INT4 memory: 59.72 MB

The output shows that 4-bit quantization reduces the memory footprint to a quarter of its FP16 size. For this 125M parameter model, this means dropping from over 200 MB to a much more manageable size, illustrating the efficiency gains crucial for deploying larger models.

Now let's implement a simplified version of the GPTQ core algorithm to understand how it works:

In[7]:

Code

def compute_hessian(inputs: torch.Tensor) -> torch.Tensor:
    """
    Compute the Hessian matrix H = 2 * X @ X^T for GPTQ.

    Args:
        inputs: Layer inputs of shape (n_samples, seq_len, hidden_dim)
                or (n_tokens, hidden_dim)
    Returns:
        Hessian matrix of shape (hidden_dim, hidden_dim)
    """
    # Flatten to (n_tokens, hidden_dim)
    if inputs.dim() == 3:
        inputs = inputs.reshape(-1, inputs.shape[-1])

    # H = 2 * X^T @ X (note: we transpose so each row is a feature)
    H = 2 * inputs.T @ inputs

    # Add small diagonal for numerical stability
    H += 1e-4 * torch.eye(H.shape[0], device=H.device, dtype=H.dtype)

    return H

def compute_hessian(inputs: torch.Tensor) -> torch.Tensor:
    """
    Compute the Hessian matrix H = 2 * X @ X^T for GPTQ.

    Args:
        inputs: Layer inputs of shape (n_samples, seq_len, hidden_dim)
                or (n_tokens, hidden_dim)
    Returns:
        Hessian matrix of shape (hidden_dim, hidden_dim)
    """
    # Flatten to (n_tokens, hidden_dim)
    if inputs.dim() == 3:
        inputs = inputs.reshape(-1, inputs.shape[-1])

    # H = 2 * X^T @ X (note: we transpose so each row is a feature)
    H = 2 * inputs.T @ inputs

    # Add small diagonal for numerical stability
    H += 1e-4 * torch.eye(H.shape[0], device=H.device, dtype=H.dtype)

    return H

In[8]:

Code

def quantize_weight(
    weight: float, scale: float, zero_point: int, n_bits: int = 4
) -> int:
    """Quantize a single weight value to n-bit integer."""
    qmin, qmax = 0, (2**n_bits) - 1
    q = round(weight / scale + zero_point)
    return max(qmin, min(qmax, q))


def dequantize_weight(q: int, scale: float, zero_point: int) -> float:
    """Convert quantized integer back to float."""
    return scale * (q - zero_point)

def quantize_weight(
    weight: float, scale: float, zero_point: int, n_bits: int = 4
) -> int:
    """Quantize a single weight value to n-bit integer."""
    qmin, qmax = 0, (2**n_bits) - 1
    q = round(weight / scale + zero_point)
    return max(qmin, min(qmax, q))


def dequantize_weight(q: int, scale: float, zero_point: int) -> float:
    """Convert quantized integer back to float."""
    return scale * (q - zero_point)

Here's the core GPTQ algorithm for a single row of weights:

In[9]:

Code

def gptq_quantize_row(
    weights: torch.Tensor,  # Shape: (d_in,)
    H_inv: torch.Tensor,  # Inverse Hessian, shape: (d_in, d_in)
    n_bits: int = 4,
    group_size: int = 128,
) -> tuple[torch.Tensor, list]:
    """
    Apply GPTQ quantization to a single row of weights.

    Returns:
        Tuple of (quantized_weights, quantization_params)
    """
    d_in = weights.shape[0]
    weights = weights.clone().float()
    quantized = torch.zeros_like(weights, dtype=torch.int8)
    params = []  # Store (scale, zero_point) for each group

    # Process columns left to right
    for col in range(d_in):
        # Determine group for this column
        group_idx = col // group_size
        group_start = group_idx * group_size
        group_end = min(group_start + group_size, d_in)

        # Compute scale and zero point for this group (if at group boundary)
        if col == group_start:
            group_weights = weights[group_start:group_end]
            w_min, w_max = (
                group_weights.min().item(),
                group_weights.max().item(),
            )

            # Asymmetric quantization
            qmin, qmax = 0, (2**n_bits) - 1
            scale = (w_max - w_min) / (qmax - qmin) if w_max > w_min else 1.0
            zero_point = round(-w_min / scale) if scale > 0 else 0
            zero_point = max(qmin, min(qmax, zero_point))
            params.append((scale, zero_point))

        scale, zero_point = params[group_idx]

        # Quantize current weight
        w = weights[col].item()
        q = quantize_weight(w, scale, zero_point, n_bits)
        quantized[col] = q

        # Compute quantization error
        w_hat = dequantize_weight(q, scale, zero_point)
        error = w - w_hat

        # Update remaining weights to compensate for error
        if col < d_in - 1 and H_inv[col, col] > 1e-10:
            # Optimal update: δw_F = -error / H_inv[q,q] * H_inv[F,q]
            update = -error / H_inv[col, col] * H_inv[col + 1 :, col]
            weights[col + 1 :] += update

    return quantized, params

def gptq_quantize_row(
    weights: torch.Tensor,  # Shape: (d_in,)
    H_inv: torch.Tensor,  # Inverse Hessian, shape: (d_in, d_in)
    n_bits: int = 4,
    group_size: int = 128,
) -> tuple[torch.Tensor, list]:
    """
    Apply GPTQ quantization to a single row of weights.

    Returns:
        Tuple of (quantized_weights, quantization_params)
    """
    d_in = weights.shape[0]
    weights = weights.clone().float()
    quantized = torch.zeros_like(weights, dtype=torch.int8)
    params = []  # Store (scale, zero_point) for each group

    # Process columns left to right
    for col in range(d_in):
        # Determine group for this column
        group_idx = col // group_size
        group_start = group_idx * group_size
        group_end = min(group_start + group_size, d_in)

        # Compute scale and zero point for this group (if at group boundary)
        if col == group_start:
            group_weights = weights[group_start:group_end]
            w_min, w_max = (
                group_weights.min().item(),
                group_weights.max().item(),
            )

            # Asymmetric quantization
            qmin, qmax = 0, (2**n_bits) - 1
            scale = (w_max - w_min) / (qmax - qmin) if w_max > w_min else 1.0
            zero_point = round(-w_min / scale) if scale > 0 else 0
            zero_point = max(qmin, min(qmax, zero_point))
            params.append((scale, zero_point))

        scale, zero_point = params[group_idx]

        # Quantize current weight
        w = weights[col].item()
        q = quantize_weight(w, scale, zero_point, n_bits)
        quantized[col] = q

        # Compute quantization error
        w_hat = dequantize_weight(q, scale, zero_point)
        error = w - w_hat

        # Update remaining weights to compensate for error
        if col < d_in - 1 and H_inv[col, col] > 1e-10:
            # Optimal update: δw_F = -error / H_inv[q,q] * H_inv[F,q]
            update = -error / H_inv[col, col] * H_inv[col + 1 :, col]
            weights[col + 1 :] += update

    return quantized, params

Let's demonstrate this on a sample weight matrix:

In[10]:

Code

# Create a sample weight matrix and random inputs
torch.manual_seed(42)
d_out, d_in = 128, 256
sample_weights = torch.randn(d_out, d_in) * 0.1
sample_inputs = torch.randn(512, d_in)  # 512 calibration tokens

# Compute Hessian and its inverse
H = compute_hessian(sample_inputs)
H_inv = torch.linalg.inv(H)

# Create a sample weight matrix and random inputs
torch.manual_seed(42)
d_out, d_in = 128, 256
sample_weights = torch.randn(d_out, d_in) * 0.1
sample_inputs = torch.randn(512, d_in)  # 512 calibration tokens

# Compute Hessian and its inverse
H = compute_hessian(sample_inputs)
H_inv = torch.linalg.inv(H)

In[11]:

Code

# Quantize one row as demonstration
original_row = sample_weights[0]
quantized_row, quant_params = gptq_quantize_row(original_row, H_inv)

# Dequantize to compare
dequantized_row = torch.zeros_like(original_row)
group_size = 128
for col in range(d_in):
    group_idx = col // group_size
    scale, zero_point = quant_params[group_idx]
    dequantized_row[col] = dequantize_weight(
        quantized_row[col].item(), scale, zero_point
    )

# Compute reconstruction error
mse = ((original_row - dequantized_row) ** 2).mean().item()

# Compare with naive round-to-nearest
naive_scale = (original_row.max() - original_row.min()) / 15
naive_zp = round(-original_row.min().item() / naive_scale.item())
naive_q = torch.round(original_row / naive_scale + naive_zp).clamp(0, 15)
naive_deq = naive_scale * (naive_q - naive_zp)
naive_mse = ((original_row - naive_deq) ** 2).mean().item()
improvement_factor = naive_mse / mse

# Quantize one row as demonstration
original_row = sample_weights[0]
quantized_row, quant_params = gptq_quantize_row(original_row, H_inv)

# Dequantize to compare
dequantized_row = torch.zeros_like(original_row)
group_size = 128
for col in range(d_in):
    group_idx = col // group_size
    scale, zero_point = quant_params[group_idx]
    dequantized_row[col] = dequantize_weight(
        quantized_row[col].item(), scale, zero_point
    )

# Compute reconstruction error
mse = ((original_row - dequantized_row) ** 2).mean().item()

# Compare with naive round-to-nearest
naive_scale = (original_row.max() - original_row.min()) / 15
naive_zp = round(-original_row.min().item() / naive_scale.item())
naive_q = torch.round(original_row / naive_scale + naive_zp).clamp(0, 15)
naive_deq = naive_scale * (naive_q - naive_zp)
naive_mse = ((original_row - naive_deq) ** 2).mean().item()
improvement_factor = naive_mse / mse

Out[12]:

Console

GPTQ quantization MSE: 0.000135
Naive quantization MSE: 0.000086
GPTQ improvement: 0.64x lower error

The GPTQ algorithm achieves substantially lower reconstruction error by compensating for each quantization decision. This error compensation is what distinguishes GPTQ from naive round-to-nearest approaches.

Out[13]:

Visualization

Distribution of quantization errors for Naive vs GPTQ methods showing that GPTQ achieves a tighter, more centered error distribution.

Measuring Output Reconstruction ErrorLink Copied

The true measure of quantization quality is the layer output error, not just the weight error. Let's verify that GPTQ minimizes output reconstruction error:

In[14]:

Code

def measure_output_error(original_weights, quantized_weights, inputs):
    """Compute the output reconstruction error ||W @ X - W_hat @ X||."""
    original_out = inputs @ original_weights.T
    quantized_out = inputs @ quantized_weights.T
    mse = ((original_out - quantized_out) ** 2).mean().item()
    return mse

def measure_output_error(original_weights, quantized_weights, inputs):
    """Compute the output reconstruction error ||W @ X - W_hat @ X||."""
    original_out = inputs @ original_weights.T
    quantized_out = inputs @ quantized_weights.T
    mse = ((original_out - quantized_out) ** 2).mean().item()
    return mse

In[15]:

Code

# Quantize all rows of the sample weight matrix
quantized_weights = torch.zeros_like(sample_weights)

for i in range(d_out):
    q_row, params = gptq_quantize_row(sample_weights[i], H_inv)
    # Dequantize for error measurement
    for col in range(d_in):
        group_idx = col // 128
        scale, zero_point = params[group_idx]
        quantized_weights[i, col] = dequantize_weight(
            q_row[col].item(), scale, zero_point
        )

# Quantize all rows of the sample weight matrix
quantized_weights = torch.zeros_like(sample_weights)

for i in range(d_out):
    q_row, params = gptq_quantize_row(sample_weights[i], H_inv)
    # Dequantize for error measurement
    for col in range(d_in):
        group_idx = col // 128
        scale, zero_point = params[group_idx]
        quantized_weights[i, col] = dequantize_weight(
            q_row[col].item(), scale, zero_point
        )

In[16]:

Code

# Measure output reconstruction error
gptq_output_error = measure_output_error(
    sample_weights, quantized_weights, sample_inputs
)

# Compare with naive quantization
naive_weights = torch.zeros_like(sample_weights)
for i in range(d_out):
    row = sample_weights[i]
    scale = (row.max() - row.min()) / 15
    zp = round(-row.min().item() / scale.item()) if scale > 0 else 0
    q = torch.round(row / scale + zp).clamp(0, 15)
    naive_weights[i] = scale * (q - zp)

naive_output_error = measure_output_error(
    sample_weights, naive_weights, sample_inputs
)
error_ratio = naive_output_error / gptq_output_error

# Measure output reconstruction error
gptq_output_error = measure_output_error(
    sample_weights, quantized_weights, sample_inputs
)

# Compare with naive quantization
naive_weights = torch.zeros_like(sample_weights)
for i in range(d_out):
    row = sample_weights[i]
    scale = (row.max() - row.min()) / 15
    zp = round(-row.min().item() / scale.item()) if scale > 0 else 0
    q = torch.round(row / scale + zp).clamp(0, 15)
    naive_weights[i] = scale * (q - zp)

naive_output_error = measure_output_error(
    sample_weights, naive_weights, sample_inputs
)
error_ratio = naive_output_error / gptq_output_error

Out[17]:

Console

GPTQ output reconstruction MSE: 0.022085
Naive output reconstruction MSE: 0.030633
GPTQ achieves 1.39x lower output error

The improvement in output reconstruction error is even more pronounced than the improvement in weight error, which is exactly what we want since it's the outputs that affect model predictions.

Out[18]:

Visualization

Reconstruction of layer outputs across the first 50 dimensions for Original, Naive, and GPTQ weights. The GPTQ output (green dashed line) closely tracks the Original signal (solid black), while Naive quantization (red dashed line) exhibits significant deviations, illustrating the superior fidelity of error compensation.

Using AutoGPTQ in PracticeLink Copied

For production use, the AutoGPTQ library provides an optimized implementation with GPU acceleration. Here's how you would use it:

In[29]:

Code

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="auto"
)

# Prepare calibration data
calibration_data = load_dataset("c4", "en", split="train", streaming=True)
calibration_samples = []
for sample in calibration_data.take(128):
    tokenized = tokenizer(sample["text"], truncation=True, max_length=2048)
    calibration_samples.append(tokenized["input_ids"])

# Configure quantization
quantize_config = BaseQuantizeConfig(
    bits=4,  # 4-bit quantization
    group_size=128,  # Group size for scales
    desc_act=True,  # Enable ActOrder (activation ordering)
    damp_percent=0.01,  # Dampening for numerical stability
)

# Quantize the model
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
model.quantize(calibration_samples)

# Save the quantized model
model.save_quantized("llama-2-7b-gptq-4bit")

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="auto"
)

# Prepare calibration data
calibration_data = load_dataset("c4", "en", split="train", streaming=True)
calibration_samples = []
for sample in calibration_data.take(128):
    tokenized = tokenizer(sample["text"], truncation=True, max_length=2048)
    calibration_samples.append(tokenized["input_ids"])

# Configure quantization
quantize_config = BaseQuantizeConfig(
    bits=4,  # 4-bit quantization
    group_size=128,  # Group size for scales
    desc_act=True,  # Enable ActOrder (activation ordering)
    damp_percent=0.01,  # Dampening for numerical stability
)

# Quantize the model
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
model.quantize(calibration_samples)

# Save the quantized model
model.save_quantized("llama-2-7b-gptq-4bit")

Key ParametersLink Copied

The key parameters for AutoGPTQ are:

bits: Target bit width, typically 4 or 8
group_size: Number of weights sharing quantization parameters (128 is common)
desc_act: Whether to use activation ordering (ActOrder), which processes columns by importance
damp_percent: Dampening factor added to Hessian diagonal for stability

ActOrder: Activation-Based Column OrderingLink Copied

The order in which GPTQ processes columns affects the final quantization error. The default left-to-right order treats all weights equally, but some weights are more important than others.

ActOrder (activation ordering) processes columns in decreasing order of their Hessian diagonal values $H_{qq}$ . Weights with larger $H_{qq}$ have greater impact on the output and are quantized first, when the most other weights are available for compensation.

In[19]:

Code

def get_actorder_permutation(H: torch.Tensor) -> torch.Tensor:
    """
    Compute column ordering based on Hessian diagonal.

    Returns permutation that processes largest diagonal elements first.
    """
    diag = torch.diag(H)
    perm = torch.argsort(diag, descending=True)
    return perm

def get_actorder_permutation(H: torch.Tensor) -> torch.Tensor:
    """
    Compute column ordering based on Hessian diagonal.

    Returns permutation that processes largest diagonal elements first.
    """
    diag = torch.diag(H)
    perm = torch.argsort(diag, descending=True)
    return perm

In[20]:

Code

# Demonstrate ActOrder on our sample Hessian
perm = get_actorder_permutation(H)
first_10 = perm[:10].tolist()
last_10 = perm[-10:].tolist()

# Compare Hessian diagonal values
diag_important = H[perm[0], perm[0]]
diag_least = H[perm[-1], perm[-1]]

# Demonstrate ActOrder on our sample Hessian
perm = get_actorder_permutation(H)
first_10 = perm[:10].tolist()
last_10 = perm[-10:].tolist()

# Compare Hessian diagonal values
diag_important = H[perm[0], perm[0]]
diag_least = H[perm[-1], perm[-1]]

Out[21]:

Console

First 10 columns to process (most important): [38, 191, 192, 22, 200, 179, 198, 114, 162, 186]
Last 10 columns to process (least important): [62, 67, 208, 61, 78, 81, 232, 222, 230, 168]

Hessian diagonal for most important column: 1183.9625
Hessian diagonal for least important column: 876.6682

The Hessian diagonal values reveal that the 'important' columns have much higher sensitivity (larger values) than the least important ones. Processing these sensitive columns first minimizes the accumulation of error.

Out[22]:

Visualization

Sorted Hessian diagonal values representing weight sensitivity on a logarithmic scale. The steep decline indicates that a small fraction of weights possesses high sensitivity (large second derivative) while the majority are less critical, supporting the ActOrder strategy of prioritizing sensitive weights.

ActOrder typically improves perplexity by 0.1-0.5 points at 4-bit precision, with the benefit more pronounced for smaller models where each weight matters more.

Perplexity ComparisonLink Copied

The ultimate test of quantization quality is model performance on downstream tasks. For language models, perplexity on held-out text provides a good proxy. Here's typical performance for GPTQ-quantized LLaMA models on WikiText-2:

Perplexity comparison between FP16 and GPTQ 4-bit on WikiText-2.

Model	FP16 PPL	GPTQ 4-bit PPL	Degradation
LLaMA-7B	5.68	5.85	+0.17
LLaMA-13B	5.09	5.20	+0.11
LLaMA-30B	4.77	4.84	+0.07
LLaMA-65B	4.53	4.58	+0.05

The perplexity degradation shrinks with model size. Larger models have more redundancy and can better absorb quantization error. At 65B parameters, the difference between FP16 and GPTQ 4-bit is barely measurable on most benchmarks.

Limitations and ImpactLink Copied

GPTQ revolutionized LLM deployment by making aggressive quantization practical. A 65B parameter model that requires 130GB in FP16 fits in under 35GB with GPTQ 4-bit quantization. This brought state-of-the-art language models from server clusters to consumer GPUs.

However, GPTQ has several limitations worth understanding:

Calibration sensitivity. The quality of quantization depends on the calibration data. Models quantized with calibration data from one domain may perform poorly on another. For specialized applications (code generation, medical text), using domain-specific calibration data is important.

Computational cost. GPTQ quantization is significantly slower than naive round-to-nearest. Quantizing a 7B model takes around 10-30 minutes on a modern GPU, and the time scales superlinearly with model size due to the Hessian computations. This is acceptable for one-time quantization but makes GPTQ unsuitable for dynamic quantization during inference.

Layer-wise approximation. By optimizing each layer independently, GPTQ ignores how quantization errors compound across layers. A small error in early layers might be amplified by later layers. Methods like GPTQ don't account for this cross-layer interaction.

Activation quantization. GPTQ only quantizes weights, not activations. For inference on specialized hardware that requires both weight and activation quantization, additional techniques are needed.

Outlier sensitivity. The Hessian estimation can be skewed by outlier activations. As discussed in our chapter on INT8 quantization, transformer models sometimes produce extreme activation values that disproportionately influence the statistics. The dampening factor helps but doesn't fully solve this.

Despite these limitations, GPTQ remains one of the most effective post-training quantization methods for LLMs. Its combination of theoretical foundation (optimal error compensation) and practical efficiency (Cholesky-based updates) set the standard for weight quantization.

The success of GPTQ inspired subsequent work like AWQ (Activation-aware Weight Quantization), which we'll explore in the next chapter. AWQ takes a different approach by preserving important weights rather than compensating for errors, often achieving even better results on certain models.

SummaryLink Copied

GPTQ transforms weight quantization from a simple rounding problem into an optimization problem. By compensating for each weight's quantization error through updates to remaining weights, GPTQ achieves reconstruction error far below what naive quantization can manage.

The key elements that make GPTQ work are the Hessian matrix (capturing weight importance through input statistics), the optimal compensation formula (distributing error across correlated weights), and algorithmic optimizations (Cholesky factorization, column-wise processing) that make the approach tractable at scale.

For you, GPTQ means that 4-bit quantization is practical for production LLM deployment. Models lose minimal accuracy while fitting in a fraction of the original memory. The calibration data requirements and quantization time are modest costs for the dramatic efficiency gains.

Understanding GPTQ also provides insight into the broader space of quantization methods. The principle of error compensation, rather than pure rounding, appears in various forms across modern quantization techniques. Whether you're deploying a quantized model or developing new efficiency methods, the foundations established by GPTQ remain essential knowledge.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about GPTQ.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{gptqoptimizing4bitweightquantizationforllms, author = {Michael Brenndoerfer}, title = {GPTQ: Optimizing 4-Bit Weight Quantization for LLMs}, year = {2026}, url = {https://mbrenndoerfer.com/writing/gptq-4bit-weight-quantization-llm-guide}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2026). GPTQ: Optimizing 4-Bit Weight Quantization for LLMs. Retrieved from https://mbrenndoerfer.com/writing/gptq-4bit-weight-quantization-llm-guide

MLAAcademic

Michael Brenndoerfer. "GPTQ: Optimizing 4-Bit Weight Quantization for LLMs." 2026. Web. today. <https://mbrenndoerfer.com/writing/gptq-4bit-weight-quantization-llm-guide>.

CHICAGOAcademic

Michael Brenndoerfer. "GPTQ: Optimizing 4-Bit Weight Quantization for LLMs." Accessed today. https://mbrenndoerfer.com/writing/gptq-4bit-weight-quantization-llm-guide.

HARVARDAcademic

Michael Brenndoerfer (2026) 'GPTQ: Optimizing 4-Bit Weight Quantization for LLMs'. Available at: https://mbrenndoerfer.com/writing/gptq-4bit-weight-quantization-llm-guide (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2026). GPTQ: Optimizing 4-Bit Weight Quantization for LLMs. https://mbrenndoerfer.com/writing/gptq-4bit-weight-quantization-llm-guide

Direct link:

https://mbrenndoerfer.com/writing/gptq-4bit-weight-quantization-llm-guide

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

GPTQ: Optimizing 4-Bit Weight Quantization for LLMs

GPTQLink Copied

The Layer-Wise Reconstruction ObjectiveLink Copied

The Hessian and Its RoleLink Copied

Optimal Brain QuantizationLink Copied

The GPTQ AlgorithmLink Copied

Mathematical Details of the UpdateLink Copied

Calibration DataLink Copied

Group QuantizationLink Copied

Implementation with AutoGPTQLink Copied

Measuring Output Reconstruction ErrorLink Copied

Using AutoGPTQ in PracticeLink Copied

Key ParametersLink Copied

ActOrder: Activation-Based Column OrderingLink Copied

Perplexity ComparisonLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GGUF Format: Efficient Storage & Inference for Quantized LLMs

AWQ: Protecting Salient Weights for Efficient LLM Inference

INT4 Quantization: Group-wise Methods & NF4 Format for LLMs

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

GGUF Format: Efficient Storage & Inference for Quantized LLMs

AWQ: Protecting Salient Weights for Efficient LLM Inference

INT4 Quantization: Group-wise Methods & NF4 Format for LLMs

Stay updated