Weight Quantization Basics: Scale, Zero-Point & Calibration

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Learn how weight quantization maps floating-point values to integers, reducing LLM memory by 4x. Covers scale, zero-point, symmetric vs asymmetric schemes.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Weight Quantization BasicsLink Copied

Large language models are memory hogs. A 7 billion parameter model stored in 16-bit precision requires 14 GB just for the weights, and larger models scale proportionally: 70B parameters means 140 GB. As we discussed in the KV Cache chapters, memory bandwidth often bottlenecks LLM inference more than raw compute. Every forward pass must load billions of parameters from memory into the GPU's compute units, and this data movement dominates inference latency.

Quantization provides a direct solution: represent model weights using fewer bits. Instead of 16 or 32 bits per parameter, we can use 8, 4, or even fewer bits. This shrinks memory footprint, reduces bandwidth requirements, and often enables hardware-accelerated integer arithmetic that runs faster than floating-point operations. A 4-bit quantized 70B model fits in 35 GB, making it deployable on consumer hardware that couldn't touch the original.

But quantization isn't free. Cramming continuous values into discrete bins inevitably loses information. The goal of quantization is minimizing this loss while maximizing compression. This chapter covers the fundamental concepts: how quantization maps floats to integers, the different schemes for choosing that mapping, and how calibration finds optimal parameters for a given model.

Why Memory Bandwidth Limits InferenceLink Copied

Before diving into quantization mechanics, let's understand why memory dominates LLM inference costs. Modern GPUs have enormous compute throughput. An NVIDIA A100 delivers 312 teraflops of FP16 compute but only 2 TB/s of memory bandwidth. For each byte loaded from memory, the GPU can perform roughly 156 floating-point operations.

During autoregressive generation, each token requires loading essentially all model weights once. A single matrix multiplication of shape $(1, d_\text{model}) \times (d_\text{model}, d_\text{ffn})$ (where $d_\text{model}$ is the model dimension and $d_\text{ffn}$ is the feed-forward dimension) loads $d_\text{model} \times d_\text{ffn}$ parameters but performs only $2 \times d_\text{model} \times d_\text{ffn}$ floating-point operations. With a batch size of 1, this gives an arithmetic intensity of just 2 FLOPs per loaded element. GPUs capable of 156 FLOPs per byte sit idle waiting for data.

Quantization directly attacks this bottleneck. Loading 4-bit weights instead of 16-bit weights moves 4 times less data across the memory bus. Even if dequantization adds some computational overhead, the memory savings dominate for small-batch inference. This explains why quantization has become essential for efficient LLM deployment.

Out[2]:

Visualization

The Quantization MappingLink Copied

Quantization

The process of mapping values from a continuous (or high-precision) representation to a discrete (or lower-precision) representation. In neural network contexts, this typically means converting 32-bit or 16-bit floating-point numbers to 8-bit or smaller integers.

At its core, quantization addresses a fundamental tension between precision and efficiency. Neural networks learn weights as floating-point numbers, which can represent an enormous range of values with fine granularity. However, this precision comes at a cost: each 32-bit float consumes four bytes of memory, and operations on floating-point numbers require specialized hardware circuits. The key insight behind quantization is that neural networks are remarkably robust to small perturbations in their weights. We don't need exact values; we need values that are close enough to preserve the network's learned behavior.

Consider a weight tensor with values ranging from $-0.5$ to $0.3$ . We want to represent these using 8-bit signed integers, which can hold values from $-128$ to $127$ . The mapping must preserve the relative relationships between values while fitting them into our target range. Think of this as creating a ruler where each tick mark represents one integer value, and we need to position values from the original continuous number line onto the nearest tick mark.

The standard affine quantization formula maps a real value $x$ to a quantized integer $q$ :

q = \text{round}\left(\frac{x}{s}\right) + z

where:

$q$ : the quantized integer value
$x$ : the input real value (floating-point)
$s$ : the scale factor, determining the step size of the quantization grid
$z$ : the zero-point, an integer value to which real zero is mapped
$\text{round}(\cdot)$ : the rounding operation (typically to the nearest integer)

This formula captures the essence of what quantization does. First, we divide the real value by the scale, which converts from real units to "quantization units." The scale determines how much of the real number line each integer step represents. If the scale is 0.01, then adjacent integers represent values that differ by 0.01 in the original space. After scaling, we round to the nearest integer because integers cannot represent fractional values. Finally, we add the zero-point, which shifts the entire mapping so that real zero lands on a particular integer value rather than necessarily on integer zero.

The zero-point serves an important purpose: it allows the quantized representation to efficiently use the available integer range even when the original values are not centered at zero. If all your weights happen to be positive, you would waste half the signed integer range without a zero-point adjustment.

To recover the original value (approximately), we apply the inverse dequantization operation:

\hat{x} = s \cdot (q - z)

where:

$\hat{x}$ : the reconstructed real value (approximation of the original $x$ )
$s$ : the scale factor
$q$ : the quantized integer
$z$ : the zero-point

The dequantization formula simply reverses the quantization process. We first subtract the zero-point to undo the shift, then multiply by the scale to convert back from integer units to real units. Notice the hat notation on $\hat{x}$ , which signals that this is an approximation rather than the exact original value. The reconstructed value $\hat{x}$ won't exactly equal $x$ due to rounding, and this difference constitutes quantization error. Our goal is choosing $s$ and $z$ to minimize this error across the tensor.

Out[3]:

Visualization

Determining Scale and Zero-PointLink Copied

With the quantization formula established, we now face a practical question: how do we choose the scale and zero-point? These parameters must be selected carefully because they determine both the range of representable values and the precision within that range. The intuition is straightforward: we want to map the full range of actual tensor values onto the full range of available integers, using every bit of precision the integer format provides.

We derive the quantization parameters by establishing a linear mapping where the real minimum $x_\text{min}$ maps to the integer minimum $q_\text{min}$ and $x_\text{max}$ maps to $q_\text{max}$ . This ensures that the smallest and largest values in our tensor precisely hit the boundaries of the integer range, leaving no representable integers unused.

\begin{aligned} x_\text{min} &= s(q_\text{min} - z) \\ x_\text{max} &= s(q_\text{max} - z) \end{aligned}

where:

$x_\text{min}, x_\text{max}$ : the minimum and maximum values in the real (floating-point) tensor
$q_\text{min}, q_\text{max}$ : the minimum and maximum values in the target integer range (e.g., -128 and 127)
$s$ : the scale factor to be determined
$z$ : the zero-point to be determined

These two equations represent the boundary conditions of our linear mapping. The first equation says that when we dequantize the minimum integer value, we should recover the minimum real value. The second says the same for the maximum. Together, they give us two equations with two unknowns, which we can solve algebraically.

Subtracting the first equation from the second eliminates $z$ , which is the key insight that makes this system tractable:

\begin{aligned} x_\text{max} - x_\text{min} &= s(q_\text{max} - z) - s(q_\text{min} - z) \\ &= s \cdot q_\text{max} - s \cdot z - s \cdot q_\text{min} + s \cdot z && \text{(distribute $s$)} \\ &= s \cdot q_\text{max} - s \cdot q_\text{min} && \text{(cancel $s \cdot z$)} \\ &= s(q_\text{max} - q_\text{min}) && \text{(factor out $s$)} \end{aligned}

Solving for $s$ yields the scale factor:

s = \frac{x_\text{max} - x_\text{min}}{q_\text{max} - q_\text{min}}

where:

$s$ : the scale factor representing the step size (real value per integer increment)
$x_\text{max} - x_\text{min}$ : the total range of the continuous values
$q_\text{max} - q_\text{min}$ : the total number of discrete steps available

This formula has a straightforward interpretation: the scale is simply the ratio of the real range to the integer range. If your real values span 1.0 units and you have 256 integers available, each integer step represents $1.0/256 \approx 0.0039$ real units. The scale tells you exactly how much precision you have: smaller scales mean finer distinctions between values, while larger scales mean coarser quantization.

Substituting $s$ back into the first equation allows us to solve for the zero-point $z$ :

z = q_\text{min} - \frac{x_\text{min}}{s}

where:

$z$ : the zero-point (in real arithmetic before rounding)
$x_\text{min} / s$ : the minimum real value scaled to the integer domain

The zero-point equation determines where real zero falls in the integer range. If the real values are centered around zero, the zero-point will be near the middle of the integer range. If the real values are all positive, the zero-point will be at or near the minimum integer value.

Since $z$ must be an integer, we round the result:

z = q_\text{min} - \text{round}\left(\frac{x_\text{min}}{s}\right)

where:

$s$ : the computed scale factor
$z$ : the computed zero-point
$x_\text{min}$ : the minimum value in the floating-point tensor
$q_\text{min}$ : the minimum value of the target integer range

The rounding of the zero-point introduces a small complication: it means that real zero may not map exactly to integer $z$ , and the boundary conditions we started with may not be satisfied exactly. In practice, this slight imprecision is negligible compared to the rounding errors inherent in quantization itself.

For 8-bit unsigned integers, $q_\text{min} = 0$ and $q_\text{max} = 255$ . For 8-bit signed integers, $q_\text{min} = -128$ and $q_\text{max} = 127$ . The scale $s$ determines the resolution: smaller scales mean finer granularity but narrower representable range.

Quantization ErrorLink Copied

Every quantization operation introduces error because we are fundamentally discarding information. When we round a continuous value to the nearest integer, any fractional part is lost. Understanding the nature and magnitude of this error helps us make informed decisions about quantization parameters and bit widths.

The rounding operation introduces error that depends on where the original value falls relative to the quantization grid. For uniformly distributed values, the maximum error from rounding is $s/2$ . This occurs when a value falls exactly halfway between two adjacent quantization levels. On average, the rounding error follows a uniform distribution between $-s/2$ and $s/2$ , with mean zero and variance $s^2/12$ . This means smaller scales produce smaller errors but can only represent smaller ranges. If the true value lies outside the representable range, we must clip it, potentially causing larger errors for outliers.

The mean squared quantization error for a tensor depends on both the value distribution and the choice of $s$ and $z$ . Values near the center of the distribution incur only rounding error, while outliers may be severely clipped. This observation motivates many of the advanced quantization techniques we will explore in later chapters, which focus on handling outliers gracefully rather than allowing them to corrupt the quantization of typical values.

Symmetric vs Asymmetric QuantizationLink Copied

The two main quantization schemes differ in how they handle the zero-point. This choice significantly affects both accuracy and computational efficiency. Understanding when to use each scheme requires considering both the mathematical properties of the data and the practical constraints of the target hardware.

Symmetric QuantizationLink Copied

Symmetric quantization takes a simpler approach by forcing the zero-point to be zero ( $z = 0$ ), which centers the quantized range around zero. This constraint means that real zero always maps to integer zero, creating a symmetric mapping where positive and negative values are treated identically. The scale is determined by the largest absolute value in the tensor:

s = \frac{\max(|x|)}{q_\text{max}}

where:

$s$ : the scale factor
$\max(|x|)$ : the maximum absolute value present in the tensor
$q_\text{max}$ : the maximum positive integer in the target range

This formula ensures that the largest magnitude value, whether positive or negative, maps to the edge of the integer range. The symmetry comes from the fact that $-\max(|x|)$ would map to $-q_\text{max}$ , using the same scale factor.

For 8-bit signed integers with $q_\text{max} = 127$ , a tensor with values in $[-0.5, 0.3]$ would use $s = 0.5 / 127 \approx 0.00394$ . This means:

$-0.5$ maps to $-127$
$0.0$ maps to $0$
$0.3$ maps to $\text{round}(0.3 / 0.00394) = 76$

The main advantage of symmetric quantization is computational simplicity. With $z = 0$ , the dequantization becomes a simple multiplication:

\hat{x} = s \cdot q

where:

$\hat{x}$ : the reconstructed real value
$s$ : the scale factor
$q$ : the quantized integer

This simplification significantly improves computational efficiency. Matrix multiplications can stay in integer arithmetic longer before converting back to floating point. When multiplying two symmetrically quantized matrices, we can perform the entire integer multiplication first, then apply the combined scale factor to the result. Many hardware accelerators, including tensor cores, are optimized for this pattern.

However, symmetric quantization wastes representable range when the tensor distribution is asymmetric. In our example, values only range up to 0.3, but we allocated integer range up to 76, leaving integers 77-127 unused. This effectively reduces the precision available for actual values. Those 51 unused integer levels represent wasted precision that could have been used to distinguish between values more finely.

Asymmetric QuantizationLink Copied

Asymmetric quantization removes the symmetry constraint, allowing a non-zero $z$ and thereby using the full integer range for the actual value distribution. This flexibility comes at the cost of additional complexity but provides better precision when the data distribution is skewed.

For our tensor with values in $[-0.5, 0.3]$ :

\begin{aligned} s &= \frac{0.3 - (-0.5)}{127 - (-128)} && \text{(total real range / total integer steps)} \\ &= \frac{0.8}{255} \\ &\approx 0.00314 \end{aligned}

\begin{aligned} z &= -128 - \text{round}\left(\frac{-0.5}{0.00314}\right) && \text{(substituting into $z = q_{\text{min}} - \text{round}(x_{\text{min}}/s)$)} \\ &= -128 - (-159) \\ &= 31 \end{aligned}

Now:

$-0.5$ maps to $\text{round}(-0.5 / 0.00314) + 31 = -128$
$0.0$ maps to $\text{round}(0 / 0.00314) + 31 = 31$
$0.3$ maps to $\text{round}(0.3 / 0.00314) + 31 = 127$

The full integer range is utilized, providing finer resolution. The scale is 0.00314 instead of 0.00394, meaning each integer step represents a smaller real interval and quantization error is reduced. This improvement of approximately 20% in scale translates directly to a 20% reduction in maximum quantization error per value.

The tradeoff is computational complexity. The zero-point must be tracked and subtracted during dequantization or matrix operations. This adds overhead and can complicate hardware-optimized kernels. Additionally, the zero-point requires storage alongside the scale. For matrix multiplication between two asymmetrically quantized matrices, the computation involves cross-terms between values and zero-points that don't appear in the symmetric case.

Choosing Between SchemesLink Copied

The choice depends on the tensor distribution and hardware constraints:

Weights often follow approximately symmetric distributions centered near zero, making symmetric quantization effective with minimal range waste
Activations frequently have asymmetric distributions, especially after ReLU (all positive) or after certain normalization layers. Asymmetric quantization better captures these distributions
Hardware support varies. Some accelerators handle symmetric quantization much more efficiently, making it preferable even when asymmetric would be theoretically better

In practice, weight quantization commonly uses symmetric schemes while activation quantization uses asymmetric schemes, though this varies by implementation. The decision ultimately balances mathematical optimality against engineering constraints, and different deployment scenarios may favor different choices.

Out[4]:

Visualization

Symmetric quantization. The zero-centered mapping wastes integer range (red) when positive values are smaller than the magnitude of negative values.

Asymmetric quantization. A computed zero-point (z=31) shifts the range, utilizing the full integer spectrum for the same data distribution.

Per-Tensor vs Per-Channel QuantizationLink Copied

Beyond the symmetric/asymmetric choice, we must decide the granularity at which to compute quantization parameters. Should a single scale cover an entire weight matrix, or should different parts of the matrix have different scales? This decision profoundly affects both quantization accuracy and storage overhead.

Per-Tensor QuantizationLink Copied

The simplest approach computes one scale (and possibly one zero-point) for an entire tensor. All elements share the same quantization mapping, which means every value in the tensor is quantized using identical parameters.

This is memory-efficient: a weight matrix with millions of elements needs only 1-2 extra parameters for quantization. The overhead is essentially zero. However, if different regions of the tensor have very different value ranges, per-tensor quantization performs poorly. Outliers force a large scale, reducing precision for the majority of values.

Consider a weight matrix where most values fall in $[-0.1, 0.1]$ but a few outliers reach $\pm 1.0$ . Per-tensor symmetric 8-bit quantization uses $s = 1.0/127 \approx 0.00787$ . Values of 0.05 quantize to $\text{round}(0.05/0.00787) = 6$ and dequantize to approximately 0.047. The relative error is 6%. Meanwhile, the outlier at 1.0 maps perfectly to 127. The common-case values suffer large relative errors to accommodate rare outliers.

This phenomenon illustrates a fundamental tradeoff in quantization: we must balance precision across the entire value distribution. When a single scale must serve all values, extreme values dictate the scale, and typical values pay the precision penalty.

Per-Channel QuantizationLink Copied

Per-channel quantization addresses this limitation by computing separate scales for each output channel of a weight tensor. For a linear layer with weight matrix $W \in \mathbb{R}^{d_\text{out} \times d_\text{in}}$ (where $d_\text{out}$ is the output dimension and $d_\text{in}$ is the input dimension), we compute $d_\text{out}$ different scales, one per row.

This allows each output neuron's weights to use the full integer range based on that neuron's specific value distribution. If neuron 0 has weights in $[-0.1, 0.1]$ and neuron 1 has weights in $[-1.0, 1.0]$ , they get scales of approximately 0.00079 and 0.00787 respectively. Each achieves good precision for its own distribution, without being penalized by the other's characteristics.

The cost is storing $d_\text{out}$ scale values instead of 1. For typical transformer dimensions, this is negligible: a 4096-dimensional layer adds 4096 floats (16 KB) to store scales, while the weight matrix itself contains $4096 \times 4096 = 16M$ elements (8 MB at 4-bit). The overhead is well under 1%. This small cost buys significant accuracy improvements.

Per-Group QuantizationLink Copied

An intermediate approach divides each channel into groups of fixed size (commonly 32 or 128 elements) and computes scales per group. This handles within-channel variation while keeping overhead manageable. The group size represents a tunable parameter that balances precision against storage.

For a $4096 \times 4096$ matrix with group size 128, we need $4096 \times 32 = 131K$ scale values. At FP16, this adds 262 KB of overhead (roughly 3% of the 4-bit weight size). The finer granularity better handles outliers without significantly inflating storage.

Per-group quantization has become standard in aggressive 4-bit quantization schemes like GPTQ and AWQ, which we'll cover in upcoming chapters. The additional precision from per-group scales often makes the difference between acceptable and unacceptable accuracy loss. When pushing to very low bit widths, every bit of precision matters, and per-group quantization provides a reliable way to recover precision lost to value distribution variations.

CalibrationLink Copied

Calibration

The process of determining optimal quantization parameters (scale, zero-point, or clipping thresholds) by analyzing the actual value distributions in a model, typically using representative input data.

We've assumed we know the minimum and maximum values of each tensor, but finding these optimally requires care. Simply taking the observed min/max may not be ideal, especially for activations that vary with input data. Calibration is the process of determining these parameters systematically, and the quality of calibration directly impacts quantized model accuracy.

Weight CalibrationLink Copied

Weights are fixed after training, so their distributions are known exactly. For per-tensor quantization, we can directly compute global min/max. For per-channel or per-group, we compute statistics for each partition. Since weights don't change during inference, calibration happens once and the parameters are stored with the model.

However, raw min/max may be skewed by rare extreme values. Some methods clip the range to exclude extreme outliers, accepting large errors on those values in exchange for better precision on typical values. The reasoning is that if only 0.1% of values are extreme, accepting large errors on those values while improving precision for the other 99.9% reduces overall error. Common approaches include:

Percentile clipping: Use the 99.9th percentile instead of the true maximum, sacrificing outlier accuracy
MSE minimization: Search for the clipping threshold that minimizes mean squared quantization error over the tensor
Entropy-based: Choose thresholds that minimize KL divergence between original and quantized distributions

Activation CalibrationLink Copied

Activations present a harder problem because their distributions depend on inputs. A model might produce activations ranging from $-10$ to $10$ on one input and $-5$ to $50$ on another. We need to find quantization parameters that work well across the expected input distribution, not just for any single input.

The standard approach runs the model on a calibration dataset: a small representative sample of inputs. We collect activation statistics (min, max, percentiles, or full histograms) across this dataset, then compute quantization parameters from the aggregated statistics. The calibration dataset serves as a proxy for the true inference distribution.

Common aggregation strategies include:

Running min/max: Track the most extreme values seen across all calibration samples
Moving average: Smooth the statistics across samples to reduce sensitivity to outliers
Histogram-based: Build full histograms and choose optimal thresholds by minimizing reconstruction error

The calibration dataset should resemble actual inference inputs. Using the wrong distribution during calibration can lead to clipping or range waste during deployment. A few hundred to a few thousand samples typically suffice for stable statistics, though the exact number depends on how variable the activations are.

Dynamic vs Static QuantizationLink Copied

Calibration determines whether quantization is static or dynamic:

Static quantization fixes all quantization parameters at calibration time. Inference uses predetermined scales and zero-points, with no runtime overhead for parameter computation. This is efficient but may lose accuracy if inference distributions differ from calibration.
Dynamic quantization computes activation quantization parameters on-the-fly for each input. This adapts to the actual values but adds overhead for computing statistics during inference. It's commonly used when activation distributions are highly variable.

Weight quantization is almost always static since weights don't change. Activation quantization may be either, depending on the accuracy-efficiency tradeoff desired.

ImplementationLink Copied

Let's implement the core quantization operations to solidify these concepts. We'll start with basic symmetric quantization, then extend to asymmetric and per-channel variants. Working through concrete code helps build intuition for how the mathematical formulas translate to practice.

In[5]:

Code

import torch

# Create a sample weight tensor with a realistic distribution
torch.manual_seed(42)
weights = torch.randn(64, 128) * 0.02  # Typical initialization scale

import torch

# Create a sample weight tensor with a realistic distribution
torch.manual_seed(42)
weights = torch.randn(64, 128) * 0.02  # Typical initialization scale

Out[6]:

Console

Weight tensor shape: torch.Size([64, 128])
Value range: [-0.0767, 0.0689]
Mean: 0.000130, Std: 0.0200

The weights follow a roughly symmetric distribution centered near zero, typical of neural network parameters after initialization or training. This distribution is well-suited to symmetric quantization, which we will implement first.

Out[7]:

Visualization

Distribution of sample weight tensor values. The approximately Gaussian distribution centered near zero is typical of neural network weights and well-suited to symmetric quantization.

Symmetric Per-Tensor QuantizationLink Copied

The symmetric per-tensor implementation demonstrates the simplest form of quantization. We compute a single scale from the maximum absolute value, then apply the same quantization formula to every element in the tensor.

In[8]:

Code

def quantize_symmetric(tensor, num_bits=8):
    """Symmetric per-tensor quantization."""
    qmax = 2 ** (num_bits - 1) - 1  # For signed: 127 for 8-bit

    # Compute scale from maximum absolute value
    max_abs = tensor.abs().max()
    scale = max_abs / qmax

    # Quantize: round(x / scale)
    q_tensor = torch.round(tensor / scale).clamp(-qmax - 1, qmax)
    q_tensor = q_tensor.to(torch.int8)

    return q_tensor, scale


def dequantize_symmetric(q_tensor, scale):
    """Dequantize symmetrically quantized tensor."""
    return q_tensor.float() * scale

def quantize_symmetric(tensor, num_bits=8):
    """Symmetric per-tensor quantization."""
    qmax = 2 ** (num_bits - 1) - 1  # For signed: 127 for 8-bit

    # Compute scale from maximum absolute value
    max_abs = tensor.abs().max()
    scale = max_abs / qmax

    # Quantize: round(x / scale)
    q_tensor = torch.round(tensor / scale).clamp(-qmax - 1, qmax)
    q_tensor = q_tensor.to(torch.int8)

    return q_tensor, scale


def dequantize_symmetric(q_tensor, scale):
    """Dequantize symmetrically quantized tensor."""
    return q_tensor.float() * scale

In[9]:

Code

# Quantize and dequantize
q_weights, scale = quantize_symmetric(weights, num_bits=8)
reconstructed = dequantize_symmetric(q_weights, scale)

# Compute quantization error
error = (weights - reconstructed).abs()

# Quantize and dequantize
q_weights, scale = quantize_symmetric(weights, num_bits=8)
reconstructed = dequantize_symmetric(q_weights, scale)

# Compute quantization error
error = (weights - reconstructed).abs()

Out[10]:

Console

Scale: 0.000604
Quantized values range: [-127, 114]
Mean absolute error: 0.000149
Max absolute error: 0.000302
Relative error (vs std): 0.7448%

The mean absolute error is small compared to the weight standard deviation. 8-bit symmetric quantization preserves weights well when the distribution is roughly symmetric. The relative error of less than 1% indicates that the quantized weights closely approximate the originals, which bodes well for maintaining model accuracy.

Asymmetric Per-Tensor QuantizationLink Copied

The asymmetric implementation adds complexity by computing and tracking a zero-point. This extra parameter allows better use of the integer range when values are not centered at zero.

In[11]:

Code

def quantize_asymmetric(tensor, num_bits=8):
    """Asymmetric per-tensor quantization."""
    qmin, qmax = 0, 2**num_bits - 1  # Unsigned: 0 to 255

    # Compute scale and zero-point
    x_min, x_max = tensor.min(), tensor.max()
    scale = (x_max - x_min) / (qmax - qmin)

    # Avoid division by zero for constant tensors
    if scale == 0:
        scale = torch.tensor(1.0)

    zero_point = qmin - torch.round(x_min / scale)
    zero_point = zero_point.clamp(qmin, qmax)

    # Quantize
    q_tensor = torch.round(tensor / scale) + zero_point
    q_tensor = q_tensor.clamp(qmin, qmax).to(torch.uint8)

    return q_tensor, scale, zero_point


def dequantize_asymmetric(q_tensor, scale, zero_point):
    """Dequantize asymmetrically quantized tensor."""
    return (q_tensor.float() - zero_point) * scale

def quantize_asymmetric(tensor, num_bits=8):
    """Asymmetric per-tensor quantization."""
    qmin, qmax = 0, 2**num_bits - 1  # Unsigned: 0 to 255

    # Compute scale and zero-point
    x_min, x_max = tensor.min(), tensor.max()
    scale = (x_max - x_min) / (qmax - qmin)

    # Avoid division by zero for constant tensors
    if scale == 0:
        scale = torch.tensor(1.0)

    zero_point = qmin - torch.round(x_min / scale)
    zero_point = zero_point.clamp(qmin, qmax)

    # Quantize
    q_tensor = torch.round(tensor / scale) + zero_point
    q_tensor = q_tensor.clamp(qmin, qmax).to(torch.uint8)

    return q_tensor, scale, zero_point


def dequantize_asymmetric(q_tensor, scale, zero_point):
    """Dequantize asymmetrically quantized tensor."""
    return (q_tensor.float() - zero_point) * scale

In[12]:

Code

# Quantize asymmetrically
q_weights_asym, scale_asym, zp = quantize_asymmetric(weights, num_bits=8)
reconstructed_asym = dequantize_asymmetric(q_weights_asym, scale_asym, zp)

error_asym = (weights - reconstructed_asym).abs()

# Quantize asymmetrically
q_weights_asym, scale_asym, zp = quantize_asymmetric(weights, num_bits=8)
reconstructed_asym = dequantize_asymmetric(q_weights_asym, scale_asym, zp)

error_asym = (weights - reconstructed_asym).abs()

Out[13]:

Console

Scale: 0.000571
Zero-point: 134
Quantized values range: [0, 255]
Mean absolute error: 0.000143
Relative error (vs std): 0.7129%

For this symmetric weight distribution, asymmetric quantization provides similar accuracy to symmetric. The zero-point is near 128, roughly the middle of the unsigned range, as expected for zero-centered data. When the data is already symmetric, asymmetric quantization offers little benefit but also causes no harm.

Per-Channel QuantizationLink Copied

Per-channel quantization computes separate scales for each row of the weight matrix, allowing each output channel to use its full precision budget independently.

In[14]:

Code

def quantize_per_channel(tensor, num_bits=8, axis=0):
    """Symmetric per-channel quantization along specified axis."""
    qmax = 2 ** (num_bits - 1) - 1

    # Compute scale per channel (reduce across the non-channel dimensions)
    # For 2D matrices: if axis=0 (rows), reduce dim 1. If axis=1 (cols), reduce dim 0.
    reduce_dim = 1 if axis == 0 else 0
    max_abs = tensor.abs().amax(dim=reduce_dim, keepdim=True)
    scale = max_abs / qmax
    scale = torch.where(scale == 0, torch.ones_like(scale), scale)

    # Quantize
    q_tensor = torch.round(tensor / scale).clamp(-qmax - 1, qmax)
    q_tensor = q_tensor.to(torch.int8)

    return q_tensor, scale


def dequantize_per_channel(q_tensor, scale, axis=0):
    """Dequantize per-channel quantized tensor."""
    # Scale already has correct broadcasting shape from quantize
    return q_tensor.float() * scale

def quantize_per_channel(tensor, num_bits=8, axis=0):
    """Symmetric per-channel quantization along specified axis."""
    qmax = 2 ** (num_bits - 1) - 1

    # Compute scale per channel (reduce across the non-channel dimensions)
    # For 2D matrices: if axis=0 (rows), reduce dim 1. If axis=1 (cols), reduce dim 0.
    reduce_dim = 1 if axis == 0 else 0
    max_abs = tensor.abs().amax(dim=reduce_dim, keepdim=True)
    scale = max_abs / qmax
    scale = torch.where(scale == 0, torch.ones_like(scale), scale)

    # Quantize
    q_tensor = torch.round(tensor / scale).clamp(-qmax - 1, qmax)
    q_tensor = q_tensor.to(torch.int8)

    return q_tensor, scale


def dequantize_per_channel(q_tensor, scale, axis=0):
    """Dequantize per-channel quantized tensor."""
    # Scale already has correct broadcasting shape from quantize
    return q_tensor.float() * scale

To illustrate the benefits of per-channel quantization, we create a tensor where different channels have dramatically different value ranges. This simulates the real-world situation where some neurons develop larger weights than others during training.

In[15]:

Code

# Create a tensor with varying scales per channel
weights_varied = weights.clone()
weights_varied[0] *= 10  # First channel has larger values
weights_varied[1] *= 0.1  # Second channel has smaller values

# Compare per-tensor vs per-channel
q_pt, s_pt = quantize_symmetric(weights_varied)
recon_pt = dequantize_symmetric(q_pt, s_pt)

q_pc, s_pc = quantize_per_channel(weights_varied, axis=0)
recon_pc = dequantize_per_channel(q_pc, s_pc, axis=0)

# Compute quantization errors
error_pt = (weights_varied - recon_pt).abs()
error_pc = (weights_varied - recon_pc).abs()

# Create a tensor with varying scales per channel
weights_varied = weights.clone()
weights_varied[0] *= 10  # First channel has larger values
weights_varied[1] *= 0.1  # Second channel has smaller values

# Compare per-tensor vs per-channel
q_pt, s_pt = quantize_symmetric(weights_varied)
recon_pt = dequantize_symmetric(q_pt, s_pt)

q_pc, s_pc = quantize_per_channel(weights_varied, axis=0)
recon_pc = dequantize_per_channel(q_pc, s_pc, axis=0)

# Compute quantization errors
error_pt = (weights_varied - recon_pt).abs()
error_pc = (weights_varied - recon_pc).abs()

Out[16]:

Console

Per-tensor quantization:
  Single scale: 0.003952
  Channel 0 error (large weights): 0.001089
  Channel 1 error (small weights): 0.000978

Per-channel quantization:
  Channel 0 scale: 0.003952
  Channel 1 scale: 0.000039
  Channel 0 error: 0.001089
  Channel 1 error: 0.000010

Per-channel quantization dramatically reduces error for channel 1. With per-tensor quantization, the large channel forces a big scale, causing the small channel to quantize coarsely. Per-channel quantization gives each channel an appropriate scale. The error reduction for channel 1 can be several orders of magnitude, demonstrating why per-channel quantization is essential for high-accuracy quantization.

Out[17]:

Visualization

Per-tensor quantization error. The global scale is determined by the first channel's large values (row 0), resulting in uniformly high absolute error across all plotted channels (0-15).

Per-channel quantization error. Independent scales allow the grid to adapt to each channel's range, significantly reducing absolute error for the standard and small-value channels (rows 1-15).

Visualizing Quantization ErrorLink Copied

Visualization helps build intuition about how quantization errors distribute across values. The following plots show both the error distribution and the relationship between original and reconstructed values.

In[18]:

Code

import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
plt.hist(error.flatten().numpy(), bins=50, edgecolor="black", alpha=0.7)
plt.axvline(
    scale.item() / 2,
    color="red",
    linestyle="--",
    label=f"Scale/2 = {scale.item() / 2:.5f}",
)
plt.xlabel("Absolute Error")
plt.ylabel("Frequency")
plt.title("Quantization Error Distribution")
plt.legend()
plt.show()

import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
plt.hist(error.flatten().numpy(), bins=50, edgecolor="black", alpha=0.7)
plt.axvline(
    scale.item() / 2,
    color="red",
    linestyle="--",
    label=f"Scale/2 = {scale.item() / 2:.5f}",
)
plt.xlabel("Absolute Error")
plt.ylabel("Frequency")
plt.title("Quantization Error Distribution")
plt.legend()
plt.show()

Out[18]:

Visualization

Histogram showing quantization errors clustered around zero with symmetric tails. — Quantization error distribution. The histogram shows errors are bounded and concentrated near zero, with the maximum error limited to half the scale.

In[19]:

Code

import matplotlib.pyplot as plt
import numpy as np

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
sample_idx = np.random.choice(weights.numel(), 1000)
orig_sample = weights.flatten()[sample_idx].numpy()
recon_sample = reconstructed.flatten()[sample_idx].numpy()

plt.scatter(orig_sample, recon_sample, alpha=0.5, s=10)
plt.plot(
    [orig_sample.min(), orig_sample.max()],
    [orig_sample.min(), orig_sample.max()],
    "r--",
    label="Perfect reconstruction",
)
plt.xlabel("Original Weight")
plt.ylabel("Reconstructed Weight")
plt.title("Original vs Reconstructed Values")
plt.legend()
plt.show()

import matplotlib.pyplot as plt
import numpy as np

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
sample_idx = np.random.choice(weights.numel(), 1000)
orig_sample = weights.flatten()[sample_idx].numpy()
recon_sample = reconstructed.flatten()[sample_idx].numpy()

plt.scatter(orig_sample, recon_sample, alpha=0.5, s=10)
plt.plot(
    [orig_sample.min(), orig_sample.max()],
    [orig_sample.min(), orig_sample.max()],
    "r--",
    label="Perfect reconstruction",
)
plt.xlabel("Original Weight")
plt.ylabel("Reconstructed Weight")
plt.title("Original vs Reconstructed Values")
plt.legend()
plt.show()

Out[19]:

Visualization

Scatter plot showing tight correlation between original and reconstructed weights. — Original vs reconstructed weights. The scatter plot confirms the tight correlation between original and reconstructed weights, clustering closely along the diagonal.

The error histogram shows that quantization errors are bounded and mostly small. The maximum possible error is half the scale (when a value falls exactly between two quantization levels). The scatter plot confirms tight correlation between original and reconstructed values, with points clustering closely around the diagonal line that represents perfect reconstruction.

Simulating CalibrationLink Copied

Calibration in practice involves running the model on representative data to collect statistics. The following code simulates this process for a simple model, demonstrating both minmax and percentile-based calibration approaches.

In[20]:

Code

def calibrate_activations(
    model_fn, calibration_data, num_bits=8, method="minmax"
):
    """
    Calibrate quantization parameters for activations.

    model_fn: Function that takes input and returns activations
    calibration_data: List of input tensors
    method: 'minmax' or 'percentile'
    """
    all_activations = []

    # Collect activations across calibration set
    for data in calibration_data:
        activations = model_fn(data)
        all_activations.append(activations)

    all_activations = torch.cat(all_activations, dim=0)

    if method == "minmax":
        x_min = all_activations.min()
        x_max = all_activations.max()
    elif method == "percentile":
        # Use 99.9th percentile to exclude outliers
        x_min = torch.quantile(all_activations.float(), 0.001)
        x_max = torch.quantile(all_activations.float(), 0.999)

    # Compute asymmetric quantization parameters
    qmin, qmax = 0, 2**num_bits - 1
    scale = (x_max - x_min) / (qmax - qmin)
    zero_point = qmin - torch.round(x_min / scale)

    return scale, zero_point, x_min, x_max

def calibrate_activations(
    model_fn, calibration_data, num_bits=8, method="minmax"
):
    """
    Calibrate quantization parameters for activations.

    model_fn: Function that takes input and returns activations
    calibration_data: List of input tensors
    method: 'minmax' or 'percentile'
    """
    all_activations = []

    # Collect activations across calibration set
    for data in calibration_data:
        activations = model_fn(data)
        all_activations.append(activations)

    all_activations = torch.cat(all_activations, dim=0)

    if method == "minmax":
        x_min = all_activations.min()
        x_max = all_activations.max()
    elif method == "percentile":
        # Use 99.9th percentile to exclude outliers
        x_min = torch.quantile(all_activations.float(), 0.001)
        x_max = torch.quantile(all_activations.float(), 0.999)

    # Compute asymmetric quantization parameters
    qmin, qmax = 0, 2**num_bits - 1
    scale = (x_max - x_min) / (qmax - qmin)
    zero_point = qmin - torch.round(x_min / scale)

    return scale, zero_point, x_min, x_max

In[21]:

Code

# Define fixed weights for the mock model
input_dim = 128
hidden_dim = 64
W = torch.randn(input_dim, hidden_dim) * 0.1
b = torch.randn(hidden_dim) * 0.01


# Simulate a simple activation function
def mock_model(x):
    # Simulates ReLU-like activations (mostly positive with some outliers)
    # Using fixed weights ensures consistent behavior across calls
    return torch.relu(x @ W + b)


# Generate calibration data
calibration_data = [torch.randn(32, 128) for _ in range(10)]

# Calibrate with both methods
scale_mm, zp_mm, min_mm, max_mm = calibrate_activations(
    mock_model, calibration_data, method="minmax"
)
scale_pct, zp_pct, min_pct, max_pct = calibrate_activations(
    mock_model, calibration_data, method="percentile"
)

# Define fixed weights for the mock model
input_dim = 128
hidden_dim = 64
W = torch.randn(input_dim, hidden_dim) * 0.1
b = torch.randn(hidden_dim) * 0.01


# Simulate a simple activation function
def mock_model(x):
    # Simulates ReLU-like activations (mostly positive with some outliers)
    # Using fixed weights ensures consistent behavior across calls
    return torch.relu(x @ W + b)


# Generate calibration data
calibration_data = [torch.randn(32, 128) for _ in range(10)]

# Calibrate with both methods
scale_mm, zp_mm, min_mm, max_mm = calibrate_activations(
    mock_model, calibration_data, method="minmax"
)
scale_pct, zp_pct, min_pct, max_pct = calibrate_activations(
    mock_model, calibration_data, method="percentile"
)

Out[22]:

Console

MinMax Calibration:
  Range: [0.0000, 4.1836]
  Scale: 0.016406

Percentile Calibration (99.9%):
  Range: [0.0000, 3.4641]
  Scale: 0.013585

Scale reduction from percentile clipping: 17.2%

Percentile-based calibration produces a smaller scale, providing finer quantization resolution for the bulk of activations at the cost of clipping rare outliers. The scale reduction directly translates to smaller quantization errors for typical values, which usually improves overall model accuracy.

Out[23]:

Visualization

Comparison of minmax vs percentile calibration ranges. Percentile clipping (dashed lines) excludes extreme outliers, allowing a smaller scale and finer precision for typical activation values.

Comparing Bit WidthsLink Copied

The following experiment measures how quantization error changes with bit width, demonstrating the exponential relationship between precision and error.

In[24]:

Code

def measure_quantization_error(tensor, num_bits):
    """Measure mean squared error for given bit width."""
    q, s = quantize_symmetric(tensor, num_bits)
    recon = dequantize_symmetric(q, s)
    mse = ((tensor - recon) ** 2).mean()
    return mse.item()


# Test different bit widths
bit_widths = [2, 3, 4, 5, 6, 7, 8]
errors = [measure_quantization_error(weights, b) for b in bit_widths]

def measure_quantization_error(tensor, num_bits):
    """Measure mean squared error for given bit width."""
    q, s = quantize_symmetric(tensor, num_bits)
    recon = dequantize_symmetric(q, s)
    mse = ((tensor - recon) ** 2).mean()
    return mse.item()


# Test different bit widths
bit_widths = [2, 3, 4, 5, 6, 7, 8]
errors = [measure_quantization_error(weights, b) for b in bit_widths]

In[25]:

Code

import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()

plt.semilogy(bit_widths, errors, "o-", markersize=8, linewidth=2)
plt.xlabel("Bit Width")
plt.ylabel("Mean Squared Error (log scale)")
plt.title("Quantization Error vs Bit Width")
plt.xticks(bit_widths)
plt.grid(True, alpha=0.3)

# Add annotations
for bw, err in zip(bit_widths, errors):
    plt.annotate(
        f"{err:.2e}",
        (bw, err),
        textcoords="offset points",
        xytext=(0, 10),
        ha="center",
        fontsize=8,
    )

plt.show()

import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()

plt.semilogy(bit_widths, errors, "o-", markersize=8, linewidth=2)
plt.xlabel("Bit Width")
plt.ylabel("Mean Squared Error (log scale)")
plt.title("Quantization Error vs Bit Width")
plt.xticks(bit_widths)
plt.grid(True, alpha=0.3)

# Add annotations
for bw, err in zip(bit_widths, errors):
    plt.annotate(
        f"{err:.2e}",
        (bw, err),
        textcoords="offset points",
        xytext=(0, 10),
        ha="center",
        fontsize=8,
    )

plt.show()

Out[25]:

Visualization

Log-scale plot showing exponential decrease in MSE with increasing bit width. — Mean squared quantization error decreases exponentially as bit width increases. Each additional bit reduces the mean squared error by a factor of approximately 4, following theoretical predictions.

Error drops roughly by a factor of 4 for each additional bit. This follows from quantization theory: each extra bit doubles the number of representable levels, halving the quantization step size, which reduces mean squared error by a factor of 4. This relationship helps you make informed decisions about the precision-accuracy tradeoff for your specific applications.

Key ParametersLink Copied

The key parameters for the quantization functions are:

num_bits: The bit width for the target integer representation (e.g., 8 for INT8, 4 for INT4). Lower bits increase compression but reduce precision.
axis: The dimension along which quantization parameters (scale, zero-point) are computed. For per-channel quantization of Linear layers, this is typically the output dimension (axis 0).
method: The strategy for determining quantization ranges during calibration (e.g., 'minmax' or 'percentile').

Limitations and Practical ConsiderationsLink Copied

Quantization's appeal lies in its simplicity, but several challenges arise in practice.

Outlier sensitivity remains the primary obstacle to aggressive quantization. Transformer models frequently develop weight and activation outliers during training. A single extreme value in a tensor can force a large scale, wasting precision for all other values. Per-channel and per-group quantization mitigate this for weights, but handling activation outliers often requires more sophisticated techniques. Methods like LLM.int8() detect outliers at runtime and process them in higher precision, while AWQ and GPTQ (covered in upcoming chapters) use careful calibration to minimize outlier impact.

Accuracy degradation varies dramatically across models and tasks. Simple tasks like classification often tolerate aggressive quantization with minimal accuracy loss. Generation tasks, especially those requiring precise reasoning or factual recall, prove more sensitive. A model that performs well on perplexity benchmarks after 4-bit quantization may show subtle degradation in reasoning quality or factual consistency. Always evaluate on task-specific metrics, not just generic benchmarks.

Calibration dataset selection significantly impacts results. Using the wrong distribution during calibration can cause systematic clipping or range waste during deployment. For general-purpose models, calibration data should span the expected input distribution. For specialized applications, calibration on domain-specific data often yields better results than generic calibration.

Hardware support determines practical speedup. Integer operations are only faster if the hardware has efficient integer compute units. Modern GPUs offer accelerated INT8 and sometimes INT4 operations through tensor cores, but not all hardware supports all quantization formats. Some quantization schemes that theoretically reduce compute find no actual speedup due to missing hardware support or kernel availability.

Despite these challenges, quantization has enabled an explosion in LLM accessibility. Models that required enterprise-grade hardware now run on laptops and smartphones. The memory savings cascade into reduced serving costs, making AI applications economically viable at scales previously impossible. The next chapters will explore specific quantization methods, INT8 and INT4 techniques, and formats like GPTQ and AWQ that push compression further while maintaining accuracy.

SummaryLink Copied

Quantization maps high-precision floating-point values to lower-precision integers, reducing memory footprint and often enabling faster computation. The core operation involves a scale (determining step size) and optionally a zero-point (shifting the mapping), with the choice between symmetric and asymmetric schemes trading simplicity against representation efficiency.

Quantization granularity ranges from per-tensor (one scale for all values) through per-channel (one scale per output dimension) to per-group (multiple scales within each channel). Finer granularity better handles value distribution variations at the cost of additional scale storage.

Calibration determines optimal quantization parameters by analyzing actual tensor values. For weights, this is straightforward since values are fixed. For activations, running the model on representative calibration data provides necessary statistics. Methods like percentile clipping can improve results by trading accuracy on rare outliers for better precision on common values.

The fundamental tradeoff in quantization is information loss versus compression. Fewer bits mean smaller models and faster inference but increased quantization error. Understanding this tradeoff, and the tools for managing it, prepares you for the specific quantization techniques in the following chapters.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about weight quantization fundamentals.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{weightquantizationbasicsscalezeropointcalibration, author = {Michael Brenndoerfer}, title = {Weight Quantization Basics: Scale, Zero-Point & Calibration}, year = {2026}, url = {https://mbrenndoerfer.com/writing/weight-quantization-basics-scale-zero-point-calibration}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2026). Weight Quantization Basics: Scale, Zero-Point & Calibration. Retrieved from https://mbrenndoerfer.com/writing/weight-quantization-basics-scale-zero-point-calibration

MLAAcademic

Michael Brenndoerfer. "Weight Quantization Basics: Scale, Zero-Point & Calibration." 2026. Web. today. <https://mbrenndoerfer.com/writing/weight-quantization-basics-scale-zero-point-calibration>.

CHICAGOAcademic

Michael Brenndoerfer. "Weight Quantization Basics: Scale, Zero-Point & Calibration." Accessed today. https://mbrenndoerfer.com/writing/weight-quantization-basics-scale-zero-point-calibration.

HARVARDAcademic

Michael Brenndoerfer (2026) 'Weight Quantization Basics: Scale, Zero-Point & Calibration'. Available at: https://mbrenndoerfer.com/writing/weight-quantization-basics-scale-zero-point-calibration (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2026). Weight Quantization Basics: Scale, Zero-Point & Calibration. https://mbrenndoerfer.com/writing/weight-quantization-basics-scale-zero-point-calibration

Direct link:

https://mbrenndoerfer.com/writing/weight-quantization-basics-scale-zero-point-calibration

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Weight Quantization Basics: Scale, Zero-Point & Calibration

Weight Quantization BasicsLink Copied

Why Memory Bandwidth Limits InferenceLink Copied

The Quantization MappingLink Copied

Determining Scale and Zero-PointLink Copied

Quantization ErrorLink Copied

Symmetric vs Asymmetric QuantizationLink Copied

Symmetric QuantizationLink Copied

Asymmetric QuantizationLink Copied

Choosing Between SchemesLink Copied

Per-Tensor vs Per-Channel QuantizationLink Copied

Per-Tensor QuantizationLink Copied

Per-Channel QuantizationLink Copied

Per-Group QuantizationLink Copied

CalibrationLink Copied

Weight CalibrationLink Copied

Activation CalibrationLink Copied

Dynamic vs Static QuantizationLink Copied

ImplementationLink Copied

Symmetric Per-Tensor QuantizationLink Copied

Asymmetric Per-Tensor QuantizationLink Copied

Per-Channel QuantizationLink Copied

Visualizing Quantization ErrorLink Copied

Simulating CalibrationLink Copied

Comparing Bit WidthsLink Copied

Key ParametersLink Copied

Limitations and Practical ConsiderationsLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GGUF Format: Efficient Storage & Inference for Quantized LLMs

AWQ: Protecting Salient Weights for Efficient LLM Inference

GPTQ: Optimizing 4-Bit Weight Quantization for LLMs

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

GGUF Format: Efficient Storage & Inference for Quantized LLMs

AWQ: Protecting Salient Weights for Efficient LLM Inference

GPTQ: Optimizing 4-Bit Weight Quantization for LLMs

Stay updated