INT8 Quantization: Absmax, Smooth Quantization & Implementation

Michael BrenndoerferJanuary 11, 202643 min read

Master INT8 weight quantization with absmax and smooth quantization techniques. Learn to solve the outlier problem in large language models.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

INT8 Quantization

In the previous chapter on weight quantization basics, we explored why representing neural network parameters with fewer bits can dramatically reduce memory requirements and accelerate inference. Now we dive into the most widely adopted quantization format: 8-bit integers (INT8). This precision point represents a sweet spot where memory savings are substantial (2x compression from FP16, 4x from FP32), specialized hardware acceleration is widely available, and accuracy degradation remains manageable for most applications.

INT8 quantization isn't just about storing numbers more compactly. Modern GPUs and specialized AI accelerators include dedicated INT8 matrix multiplication units that can achieve 2-4x higher throughput than their floating-point counterparts. NVIDIA's Tensor Cores, for example, perform INT8 operations at twice the rate of FP16 operations. This means that if we can quantize both weights and activations to INT8, we unlock not just memory savings but genuine compute speedup.

However, squeezing the continuous range of floating-point values into just 256 discrete integers introduces quantization error. Effective INT8 quantization minimizes this error while maximizing the benefits. We'll explore the mathematics of range mapping, the practical absmax quantization scheme, and the smooth quantization technique that makes INT8 viable even for large language models with problematic activation distributions.

The INT8 Representation Space

An 8-bit signed integer can represent values from -128 to 127, giving us 256 distinct values to work with. When we quantize a floating-point tensor, we're essentially creating a codebook that maps each of these 256 integers to a specific floating-point value. This mapping defines a correspondence between the rich, continuous space of floating-point numbers and a sparse, discrete set of representable points.

To understand this mapping intuitively, imagine you have a number line representing all possible floating-point values a weight might take. Quantization places 256 evenly-spaced "buckets" along this line, and every floating-point value must be assigned to its nearest bucket. The original value is then represented by that bucket's integer label, and reconstruction involves converting the label back to the bucket's center position. The spacing between these buckets, and where they are positioned along the number line, completely determines the quality of our approximation.

Quantization Grid

The set of representable values after quantization forms a uniform grid in the original floating-point space. The spacing between adjacent grid points is called the quantization step size or scale factor.

The fundamental challenge is choosing how to position this grid. Consider a weight tensor with values ranging from -0.5 to 1.2. We need to decide:

  • What floating-point value should -128 represent?
  • What floating-point value should 127 represent?
  • How do we handle values outside our chosen range (clipping)?

These choices define our quantization scheme and directly impact the reconstruction error. A poorly chosen grid might waste many of its 256 integers representing values that never actually occur in the tensor, while cramming the values that do occur into too few buckets. The goal is to align our quantization grid as closely as possible with the actual distribution of values we need to represent.

Out[2]:
Visualization
The quantization grid maps continuous floating-point values to discrete integer levels. Original values (blue circles) snap to the nearest grid point (orange squares), illustrating how quantization introduces error bounded by half the step size.
The quantization grid maps continuous floating-point values to discrete integer levels. Original values (blue circles) snap to the nearest grid point (orange squares), illustrating how quantization introduces error bounded by half the step size.

Symmetric Quantization with Absmax

The simplest and most common approach is symmetric quantization, where we center the quantization grid at zero. The "absmax" method finds the maximum absolute value in the tensor and maps it to the largest representable integer. This approach derives its name from the key statistic it computes: the absolute maximum which determines the extent of our quantization range in both positive and negative directions simultaneously.

The intuition behind absmax quantization is straightforward. We want to ensure that no value in our tensor falls outside the representable range, which would force us to clip it and introduce potentially large errors. By finding the largest magnitude value in the tensor and designing our grid to just barely accommodate it, we guarantee that every original value can be represented without clipping. At the same time, we make the grid as fine as possible given this constraint, since using a larger range than necessary would waste precision.

Given a tensor x\mathbf{x} with floating-point values, absmax quantization works as follows:

α=max(x)s=α127xq=round(xs)\begin{aligned} \alpha &= \max(|\mathbf{x}|) \\ s &= \frac{\alpha}{127} \\ \mathbf{x}_q &= \text{round}\left(\frac{\mathbf{x}}{s}\right) \end{aligned}

where:

  • x\mathbf{x}: the input floating-point tensor
  • α\alpha: the maximum absolute value in the input tensor (the "absmax")
  • ss: the scale factor that determines the spacing of the quantization grid
  • 127127: the maximum representable value in a signed 8-bit integer
  • xq\mathbf{x}_q: the resulting tensor of quantized 8-bit integers

Let's trace through this process step by step. First, we scan the entire tensor to find α\alpha, the largest magnitude value regardless of sign. This becomes our reference point for setting the scale. Next, we compute the scale factor ss by dividing α\alpha by 127, which tells us how much "floating-point distance" each integer step represents. Finally, we quantize each value by dividing it by the scale factor and rounding to the nearest integer. This rounding step is where the actual information loss occurs, as we snap each continuous value to its nearest discrete representation.

To recover an approximation of the original values, we simply multiply by the scale:

x^=sxq\hat{\mathbf{x}} = s \cdot \mathbf{x}_q

where:

  • x^\hat{\mathbf{x}}: the reconstructed floating-point tensor (an approximation of the original x\mathbf{x})
  • ss: the scale factor used during quantization
  • xq\mathbf{x}_q: the quantized integer tensor

The hat notation indicates this is a reconstruction, not the exact original values. The difference xx^\mathbf{x} - \hat{\mathbf{x}} is the quantization error. This error arises because the rounding operation discards fractional information. When we divided by ss and rounded, we lost the remainder, and multiplying the rounded result back by ss cannot restore what was lost. The beauty of this scheme is that the error is bounded: no reconstructed value can differ from its original by more than half a scale step, or s/2s/2.

Why Symmetric Around Zero?

Centering the grid at zero has a crucial computational advantage: the integer zero maps exactly to floating-point zero. This property, which might seem like a minor mathematical nicety, turns out to have significant practical implications for neural network inference.

This matters because:

  • Many neural network activations are zero (from ReLU, dropout, padding)
  • Zero-valued weights don't contribute to computations
  • Preserving exact zeros avoids accumulating unnecessary rounding errors

Consider what happens during a matrix multiplication when many elements are zero. If zero maps exactly to zero, these elements contribute nothing to the output, exactly as they should. But if zero were represented by some non-zero integer due to an offset, every "zero" element would actually contribute a small spurious value to the computation. Over millions of operations, these spurious contributions could accumulate into meaningful errors.

Symmetric quantization also simplifies the math during matrix multiplication. When computing Y=XW\mathbf{Y} = \mathbf{X}\mathbf{W} with quantized values, we only need to track scale factors, not additional offset terms. This simplification translates directly into faster inference, since the dequantization step requires only a single multiplication rather than a multiplication followed by an addition.

The Clipping Trade-off

Absmax quantization clips all values to the range [α,α][-\alpha, \alpha]. If α\alpha is dominated by a few outlier values, the quantization grid becomes coarse for the majority of values. This creates a fundamental tension in quantization design: we want to accommodate outliers to avoid clipping errors, but we also want a fine grid to minimize rounding errors for typical values.

Consider a tensor where 99% of values lie between -0.1 and 0.1, but one outlier is 5.0. Using absmax:

s=5.01270.0394s = \frac{5.0}{127} \approx 0.0394

This scale means values around 0.1 get quantized to just round(0.1/0.0394)=3\text{round}(0.1/0.0394) = 3. The reconstruction is 3×0.0394=0.1183 \times 0.0394 = 0.118, introducing significant relative error for these common values. To understand the magnitude of this problem, note that a typical value of 0.1 experiences an 18% relative error, while the outlier value of 5.0 would be reconstructed nearly perfectly. We have sacrificed precision where it matters most, on the bulk of our values, to perfectly preserve a single outlier.

One solution is to clip outliers: choose α\alpha smaller than the true maximum, accepting that extreme values will be clipped. This trades outlier accuracy for better precision on the bulk of values. For example, if we set α=0.5\alpha = 0.5 instead of 5.0, our scale becomes s=0.5/1270.00394s = 0.5/127 \approx 0.00394, and a value of 0.1 now quantizes to integer 25 with reconstruction 25×0.00394=0.098525 \times 0.00394 = 0.0985, a much smaller relative error. The outlier would be clipped to 127, reconstructing to 0.5 instead of 5.0, but this might be an acceptable trade-off depending on the model's sensitivity.

Finding the optimal clipping threshold is the basis for calibration-based quantization methods. These methods analyze the value distribution across representative data to find the clipping point that minimizes total reconstruction error, balancing the clipping errors from outliers against the rounding errors for typical values.

Out[3]:
Visualization
Histograms of quantized values demonstrating the clipping trade-off. The left panel shows that using the full range (absmax) compresses the bulk of data into few bins due to outliers. The right panel shows that clipping outliers allows the majority of data to use more bins, improving precision for typical values.
Histograms of quantized values demonstrating the clipping trade-off. The left panel shows that using the full range (absmax) compresses the bulk of data into few bins due to outliers. The right panel shows that clipping outliers allows the majority of data to use more bins, improving precision for typical values.
Notebook output

Asymmetric Quantization

When tensor values aren't centered around zero, symmetric quantization wastes representable range. ReLU activations, for example, are always non-negative. Mapping the range [0,3.5][0, 3.5] symmetrically would waste half our integers representing negative values that never occur. This inefficiency becomes severe when the actual data distribution is heavily skewed or shifted away from zero.

Asymmetric quantization introduces a zero-point offset zz to shift the quantization range:

s=max(x)min(x)255z=128round(min(x)s)xq=round(xs)+z\begin{aligned} s &= \frac{\max(\mathbf{x}) - \min(\mathbf{x})}{255} \\ z &= -128 - \text{round}\left(\frac{\min(\mathbf{x})}{s}\right) \\ \mathbf{x}_q &= \text{round}\left(\frac{\mathbf{x}}{s}\right) + z \end{aligned}

where:

  • x\mathbf{x}: the input floating-point tensor
  • ss: the scale factor, computed from the full range of values
  • 255255: the number of intervals in an 8-bit range (2812^8 - 1)
  • zz: the zero-point integer that shifts the grid to align with the data distribution
  • xq\mathbf{x}_q: the resulting quantized tensor

The key difference from symmetric quantization lies in how we compute the scale and the introduction of the zero-point. Instead of basing the scale on the maximum absolute value, we use the full range from minimum to maximum, ensuring all 256 integers can potentially be used. The zero-point zz is an integer that shifts the quantization grid so that the minimum value maps to -128 and the maximum to 127. This shift allows us to "slide" our quantization grid along the number line to wherever the actual data lies.

Dequantization becomes:

x^=s(xqz)\hat{\mathbf{x}} = s \cdot (\mathbf{x}_q - z)

where:

  • x^\hat{\mathbf{x}}: the reconstructed floating-point values
  • ss: the scale factor
  • xq\mathbf{x}_q: the quantized integer values
  • zz: the zero-point offset, which is subtracted before scaling

While asymmetric quantization can represent arbitrary ranges more efficiently, it complicates the math during inference. Matrix multiplications now involve additional terms from the zero-point offsets. When we expand the matrix multiplication with asymmetric quantization, cross-terms involving the zero-points appear, requiring additional computation and memory access. For weights, which are typically centered around zero due to common initialization and regularization practices, symmetric quantization remains the standard choice. The computational overhead of asymmetric quantization is generally reserved for activations where the efficiency gains from better range utilization outweigh the additional complexity.

Out[4]:
Visualization
Symmetric versus asymmetric quantization for non-negative ReLU activations. Symmetric quantization (left) leaves the negative range unused, wasting capacity. Asymmetric quantization (right) shifts the zero-point to align the grid with the data, utilizing the full integer range for representable values.
Symmetric versus asymmetric quantization for non-negative ReLU activations. Symmetric quantization (left) leaves the negative range unused, wasting capacity. Asymmetric quantization (right) shifts the zero-point to align the grid with the data, utilizing the full integer range for representable values.
Notebook output

Per-Tensor vs Per-Channel Quantization

The granularity at which we compute scale factors significantly impacts quantization quality. This design choice represents a trade-off between simplicity and accuracy, with different granularities being appropriate for different situations.

Per-tensor quantization uses a single scale factor for an entire weight matrix or activation tensor. This is simple and introduces minimal overhead, but a single outlier anywhere in the tensor degrades precision everywhere. Imagine a weight matrix where one row has unusually large values: that single row would force a large scale factor that reduces precision for all other rows.

Per-channel quantization computes separate scale factors for each output channel of a weight matrix. For a linear layer with weight matrix WRdin×dout\mathbf{W} \in \mathbb{R}^{d_{in} \times d_{out}}, we compute doutd_{out} different scale factors, one for each column (corresponding to each output channel):

sj=max(W:,j)127s_j = \frac{\max(|\mathbf{W}_{:,j}|)}{127}

where:

  • sjs_j: the scale factor for the jj-th output channel (column)
  • W:,j\mathbf{W}_{:,j}: the vector of weights corresponding to the jj-th output channel
  • max(W:,j)\max(|\mathbf{W}_{:,j}|): the maximum absolute value in that channel's weights
  • 127127: the maximum value for a signed 8-bit integer

This allows each output channel to use its full INT8 range, accommodating the fact that different channels often have different magnitude distributions. Consider why this happens: during training, different output features may learn patterns of varying intensity, leading some channels to develop larger weight magnitudes than others. Per-channel quantization respects these differences by giving each channel its own appropriately-sized quantization grid.

The overhead is storing doutd_{out} scale factors instead of one, which is negligible compared to the weight tensor itself. For a layer with 4096 output channels and 4096 input features, we store 4096 scale factors (one per channel) versus over 16 million quantized weights. The scale factors add less than 0.1% to the storage requirements while potentially dramatically improving accuracy.

Per-channel quantization for weights combined with per-tensor quantization for activations has emerged as the standard approach, offering good accuracy with manageable complexity. The asymmetry makes sense: weights are static and can be analyzed offline to compute optimal per-channel scales, while activations vary with each input and would require expensive runtime scale computation for per-channel treatment.

The Outlier Problem in Large Language Models

As language models scale to billions of parameters, a problematic pattern emerges: certain activation dimensions develop extreme outlier values. Research has shown that in models like OPT-175B and BLOOM-176B, a small number of hidden dimensions (sometimes called "massive activations" or "emergent features") can have values 10-100x larger than typical activations. This phenomenon was not anticipated by early quantization research and poses a significant challenge to naive INT8 approaches.

The emergence of these outliers appears to be linked to how transformers process information at scale. Certain hidden dimensions seem to serve as "highways" for important information, developing consistently large activation magnitudes across many inputs. These are not random fluctuations but systematic features of the model's learned representations. While we are still working to fully understand why this happens, the practical implication is clear: any quantization strategy for large language models must account for these outliers.

These outliers create a severe quantization challenge. Consider a hidden state where 99.9% of values lie in [1,1][-1, 1], but dimension 1847 consistently produces values around 50. Per-tensor absmax quantization would set s=50/1270.39s = 50/127 \approx 0.39, meaning values of magnitude 1 quantize to just 2-3 integers. The reconstruction error for these common values becomes unacceptable. A value of 0.5, which might be crucial for the model's computation, would quantize to either 1 or 2, introducing errors of up to 20% in its representation.

The naive solutions all have drawbacks:

  • Clipping outliers: Destroys important model information encoded in those dimensions
  • Per-channel activation quantization: Requires computing new scale factors for every token, adding latency
  • Mixed precision: Keeping outlier channels in FP16 complicates kernels and reduces throughput

Each of these approaches involves either losing accuracy, losing speed, or adding significant implementation complexity. What we needed was a technique that could tame the outlier problem without sacrificing the efficiency benefits of uniform INT8 quantization.

Smooth quantization offers a practical mathematical solution.

Smooth Quantization

The key insight behind smooth quantization is that while activations are hard to quantize (dynamic, contain outliers), weights are easy to quantize (static, relatively uniform). We can mathematically migrate the quantization difficulty from activations to weights through a channel-wise scaling transformation. This insight transforms the problem from one that seems intractable to one that has a clean solution.

The intuition is straightforward: if one matrix is "spiky" with outliers and another is "smooth" and well-behaved, we can transfer some of the spikiness from the first to the second. As long as we do this in a mathematically consistent way that preserves the final computation, we haven't changed the model's behavior, only redistributed the quantization difficulty.

The Mathematical Equivalence

Consider a linear layer computing Y=XW\mathbf{Y} = \mathbf{X}\mathbf{W}, where XRn×din\mathbf{X} \in \mathbb{R}^{n \times d_{in}} is the activation input and WRdin×dout\mathbf{W} \in \mathbb{R}^{d_{in} \times d_{out}} is the weight matrix. We introduce a diagonal matrix S=diag(s1,s2,,sdin)\mathbf{S} = \text{diag}(s_1, s_2, \ldots, s_{d_{in}}) with positive scaling factors:

Y=XW=(XS1)(SW)(insert identity S1S=I)=X^W^(regroup terms)\begin{aligned} \mathbf{Y} &= \mathbf{X}\mathbf{W} \\ &= (\mathbf{X}\mathbf{S}^{-1})(\mathbf{S}\mathbf{W}) && \text{(insert identity } \mathbf{S}^{-1}\mathbf{S} = \mathbf{I} \text{)} \\ &= \hat{\mathbf{X}}\hat{\mathbf{W}} && \text{(regroup terms)} \end{aligned}

where:

  • Y\mathbf{Y}: the output of the linear layer
  • X\mathbf{X}: the input activation tensor
  • W\mathbf{W}: the weight matrix
  • S\mathbf{S}: a diagonal smoothing matrix containing scaling factors sjs_j for each input dimension
  • X^\hat{\mathbf{X}}: the smoothed activation matrix (XS1\mathbf{X}\mathbf{S}^{-1})
  • W^\hat{\mathbf{W}}: the adjusted weight matrix (SW\mathbf{S}\mathbf{W})

This transformation is mathematically exact since S1S=I\mathbf{S}^{-1}\mathbf{S} = \mathbf{I}. We haven't changed the computation, just how we factor it. The key mathematical property we exploit is that multiplying by a matrix and then its inverse produces the identity, so inserting S1S\mathbf{S}^{-1}\mathbf{S} between X\mathbf{X} and W\mathbf{W} changes nothing about the final result. We are free to group these factors however we like, and we choose to group S1\mathbf{S}^{-1} with the activations and S\mathbf{S} with the weights.

The transformation affects the data as follows:

  • Smoothed activations: X^=XS1\hat{\mathbf{X}} = \mathbf{X}\mathbf{S}^{-1} divides each column jj of X\mathbf{X} by sjs_j
  • Adjusted weights: W^=SW\hat{\mathbf{W}} = \mathbf{S}\mathbf{W} multiplies each row jj of W\mathbf{W} by sjs_j

If channel jj has outlier activations, we choose a large sjs_j. This shrinks the activations (making them easier to quantize) while enlarging the corresponding weights (which are already easy to quantize). We're transferring the outlier problem to a domain where it's more manageable. Since the weights typically have plenty of "headroom" before they become difficult to quantize, they can absorb the increased magnitude from the outlier channels without significant accuracy loss.

Choosing the Smoothing Factors

How do we pick the sjs_j values? If we make sjs_j too large, we might shrink the activations so much that we lose precision in other ways, or we might inflate the weights to the point where they become hard to quantize. The optimal choice balances these competing concerns.

The SmoothQuant paper proposes balancing the quantization difficulty between activations and weights using a migration strength parameter α[0,1]\alpha \in [0, 1]:

sj=max(X:,j)αmax(Wj,:)1αs_j = \frac{\max(|\mathbf{X}_{:,j}|)^\alpha}{\max(|\mathbf{W}_{j,:}|)^{1-\alpha}}

where:

  • sjs_j: the smoothing factor for the jj-th input channel
  • max(X:,j)\max(|\mathbf{X}_{:,j}|): the maximum absolute value in the jj-th column of activations (across the calibration set)
  • max(Wj,:)\max(|\mathbf{W}_{j,:}|): the maximum absolute value in the jj-th row of weights (corresponding to the jj-th input feature)
  • α\alpha: the migration strength hyperparameter controlling how much difficulty is shifted from activations to weights

This formula balances two competing goals:

  • When α=1\alpha = 1: sj=max(X:,j)s_j = \max(|\mathbf{X}_{:,j}|), fully smoothing activations but potentially making weights hard to quantize
  • When α=0\alpha = 0: sj=1/max(Wj,:)s_j = 1/\max(|\mathbf{W}_{j,:}|), leaving activations unchanged
  • When α=0.5\alpha = 0.5: Balances the maximum values of both smoothed activations and adjusted weights

To understand this formula intuitively, consider what happens at α=0.5\alpha = 0.5. The smoothing factor becomes the geometric mean of the activation maximum and the reciprocal of the weight maximum. This choice ensures that after transformation, the maximum activation and maximum weight for each channel have roughly equal magnitude, spreading the quantization difficulty evenly between the two matrices.

In practice, α=0.5\alpha = 0.5 works well for most models, achieving roughly equal quantization ranges for both activations and weights after smoothing. However, some models may benefit from different values. Models with particularly severe activation outliers might use α>0.5\alpha > 0.5 to more aggressively smooth the activations, while models with already-difficult-to-quantize weights might use α<0.5\alpha < 0.5 to avoid making the weights worse.

Out[5]:
Visualization
Effect of migration strength $\alpha$ on activation and weight magnitudes in SmoothQuant. As $\alpha$ increases, activation magnitudes decrease while weight magnitudes increase. The balanced point at $\alpha=0.5$ distributes quantization difficulty evenly between activations and weights.
Effect of migration strength $\alpha$ on activation and weight magnitudes in SmoothQuant. As $\alpha$ increases, activation magnitudes decrease while weight magnitudes increase. The balanced point at $\alpha=0.5$ distributes quantization difficulty evenly between activations and weights.

Offline Computation

A critical practical advantage of smooth quantization is that the transformation can be computed offline during a calibration phase:

  1. Run a small calibration dataset through the model to collect activation statistics
  2. Compute max(X:,j)\max(|\mathbf{X}_{:,j}|) for each channel using these statistics
  3. Calculate smoothing factors sjs_j for each layer
  4. Apply the weight transformation W^=SW\hat{\mathbf{W}} = \mathbf{S}\mathbf{W} and store the modified weights
  5. Fuse S1\mathbf{S}^{-1} into the preceding layer's weights or normalization parameters

Step 5 is crucial: rather than explicitly dividing activations by S\mathbf{S} at runtime, we absorb this division into the bias of the previous layer or the scale/shift parameters of the preceding LayerNorm. This means inference has zero additional cost compared to standard quantized inference.

The fusion works because transformer architectures typically have a LayerNorm immediately before each linear layer. LayerNorm applies a learned scale and shift to each channel, and we can simply incorporate the smoothing factor into these existing parameters. If the original LayerNorm scale for channel jj was γj\gamma_j and shift was βj\beta_j, we replace them with γj/sj\gamma_j/s_j and βj/sj\beta_j/s_j. The activations coming out of the modified LayerNorm are already "smoothed," and we never need to perform the division at runtime.

Quantized Matrix Multiplication

Understanding how quantized operations execute helps clarify why INT8 provides speedup. When both weights and activations are in INT8, the matrix multiplication proceeds as:

Yint32=Xint8Wint8\mathbf{Y}_{\text{int32}} = \mathbf{X}_{\text{int8}} \mathbf{W}_{\text{int8}}

where:

  • Yint32\mathbf{Y}_{\text{int32}}: the intermediate result matrix stored in 32-bit integers
  • Xint8\mathbf{X}_{\text{int8}}: the quantized input activations (8-bit integers)
  • Wint8\mathbf{W}_{\text{int8}}: the quantized weights (8-bit integers)

The integer multiplication produces INT32 results to avoid overflow from accumulating many INT8×INT8 products. To understand why this is necessary, consider that each INT8×INT8 multiplication produces a result up to 127×127=16129127 \times 127 = 16129, which still fits in INT16. However, a matrix multiplication accumulates thousands or millions of such products, and these sums can easily exceed the INT16 range. INT32 provides enough headroom to accumulate the typical number of products found in neural network layers without overflow.

We then convert back to floating-point and apply scale factors:

Yfp=sXsWYint32\mathbf{Y}_{\text{fp}} = s_X \cdot s_W \cdot \mathbf{Y}_{\text{int32}}

where:

  • Yfp\mathbf{Y}_{\text{fp}}: the final floating-point output
  • sXs_X: the quantization scale factor for the activations
  • sWs_W: the quantization scale factor for the weights (scalar or vector depending on quantization granularity)
  • Yint32\mathbf{Y}_{\text{int32}}: the integer output from the matrix multiplication, converted to float before scaling

The scale factors sXs_X and sWs_W are the quantization scales for activations and weights respectively. Since scale factors are scalar (or per-channel), this final rescaling is cheap compared to the matrix multiplication itself. The computational cost of the rescaling step is O(nm)O(n \cdot m) for an output matrix of size n×mn \times m, while the matrix multiplication itself is O(nmk)O(n \cdot m \cdot k) for matrices of appropriate dimensions. Since kk is typically large (thousands of features), the rescaling overhead is negligible.

Hardware INT8 units execute this pattern efficiently because:

  • INT8 multiply-accumulate operations are simpler than floating-point
  • Smaller operands allow higher parallelism in the same silicon area
  • Memory bandwidth is halved, reducing the primary bottleneck for large models

The silicon area argument is particularly important. A floating-point multiplier requires circuits for handling sign bits, exponents, and mantissas, with additional logic for normalization and special cases. An integer multiplier needs only straightforward binary multiplication. This simplicity means that in the same chip area, hardware designers can fit more INT8 units than floating-point units, enabling higher parallelism. Combined with the reduced memory bandwidth from smaller data types, INT8 quantization provides compounding benefits that translate to real-world speedups.

Implementation

Let's implement INT8 quantization from scratch to solidify these concepts.

In[6]:
Code
import torch
import numpy as np

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

Absmax Quantization

We'll start with the fundamental absmax quantization and dequantization operations:

In[7]:
Code
def absmax_quantize(x: torch.Tensor) -> tuple[torch.Tensor, float]:
    """
    Quantize a tensor to INT8 using absmax symmetric quantization.

    Returns:
        quantized: INT8 tensor
        scale: Scale factor for dequantization
    """
    # Find the maximum absolute value
    absmax = x.abs().max().item()

    # Compute scale factor (avoid division by zero)
    scale = absmax / 127.0 if absmax != 0 else 1.0

    # Quantize: divide by scale and round to nearest integer
    quantized = torch.round(x / scale).clamp(-128, 127).to(torch.int8)

    return quantized, scale


def dequantize(quantized: torch.Tensor, scale: float) -> torch.Tensor:
    """Recover floating-point approximation from quantized tensor."""
    return quantized.float() * scale

Let's test this on a simple tensor:

In[8]:
Code
# Create a sample weight tensor
weights = torch.randn(4, 4) * 0.5

# Quantize
quantized, scale = absmax_quantize(weights)

# Dequantize
reconstructed = dequantize(quantized, scale)

# Compute error
error = (weights - reconstructed).abs()
Out[9]:
Console
Original weights:
[[ 0.9635  0.7436  0.4504 -1.0528]
 [ 0.3392 -0.6173 -0.0215 -0.8023]
 [-0.3761  0.8244 -0.1962 -0.7018]
 [-0.3639 -0.2797 -0.3844  0.3812]]

Scale factor: 0.008289

Quantized (INT8):
[[ 116   90   54 -127]
 [  41  -74   -3  -97]
 [ -45   99  -24  -85]
 [ -44  -34  -46   46]]

Reconstructed:
[[ 0.9616  0.7461  0.4476 -1.0528]
 [ 0.3399 -0.6134 -0.0249 -0.8041]
 [-0.373   0.8207 -0.1989 -0.7046]
 [-0.3647 -0.2818 -0.3813  0.3813]]

Absolute error:
[[1.881e-03 2.409e-03 2.728e-03 0.000e+00]
 [6.580e-04 3.853e-03 3.335e-03 1.744e-03]
 [3.043e-03 3.705e-03 2.708e-03 2.800e-03]
 [7.950e-04 2.127e-03 3.105e-03 9.200e-05]]

Max error: 0.003853
Mean error: 0.002186

The maximum error is bounded by half the quantization step size (s/2s/2), as values are rounded to the nearest integer. For this tensor with a scale around 0.01, our maximum possible error is about 0.005.

Analyzing Quantization Error

Let's visualize how quantization error varies across different value ranges:

In[10]:
Code
# Generate values across a range
original_values = torch.linspace(-1.5, 1.5, 1000)

# Quantize
scale = 1.5 / 127
quantized_vals = torch.round(original_values / scale).clamp(-128, 127)
reconstructed_vals = quantized_vals * scale

# Compute errors
errors = original_values - reconstructed_vals
Out[11]:
Visualization
Scatter plot showing quantization error versus original value with bounded oscillating pattern.
Quantization error analysis for a uniform grid. The scatter plot reveals the characteristic sawtooth pattern where error is bounded by plus or minus s/2. The histogram confirms that errors are uniformly distributed within this bound, minimizing bias.
Notebook output

The error follows a uniform distribution between s/2-s/2 and +s/2+s/2, which is exactly what we expect from rounding to the nearest integer on a uniform grid.

Per-Channel Quantization

Now let's implement per-channel quantization for weight matrices:

In[12]:
Code
def per_channel_quantize(
    weight: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Quantize weight matrix with per-output-channel scales.

    Args:
        weight: Shape (out_features, in_features)

    Returns:
        quantized: INT8 weight matrix
        scales: Scale factor for each output channel
    """
    # Compute absmax for each row (output channel)
    absmax_per_channel = weight.abs().max(dim=1).values

    # Compute scales (avoid division by zero)
    scales = absmax_per_channel / 127.0
    scales = torch.where(scales == 0, torch.ones_like(scales), scales)

    # Quantize each row by its scale
    quantized = (
        torch.round(weight / scales.unsqueeze(1))
        .clamp(-128, 127)
        .to(torch.int8)
    )

    return quantized, scales


def per_channel_dequantize(
    quantized: torch.Tensor, scales: torch.Tensor
) -> torch.Tensor:
    """Dequantize using per-channel scales."""
    return quantized.float() * scales.unsqueeze(1)

Let's compare per-tensor vs per-channel quantization on a weight matrix with varying row magnitudes:

In[13]:
Code
# Create weight matrix where different rows have different magnitudes
# This simulates real neural network weights
weights_varied = torch.randn(8, 16)
# Artificially scale different rows
row_scales = torch.tensor([0.1, 0.2, 0.5, 1.0, 2.0, 0.3, 0.8, 0.15])
weights_varied = weights_varied * row_scales.unsqueeze(1)

# Per-tensor quantization
qt_tensor, scale_tensor = absmax_quantize(weights_varied)
recon_tensor = dequantize(qt_tensor, scale_tensor)
error_tensor = (weights_varied - recon_tensor).abs()

# Per-channel quantization
qt_channel, scales_channel = per_channel_quantize(weights_varied)
recon_channel = per_channel_dequantize(qt_channel, scales_channel)
error_channel = (weights_varied - recon_channel).abs()
Out[14]:
Console
Row magnitude scales: [0.1  0.2  0.5  1.   2.   0.3  0.8  0.15]

Per-tensor scale: 0.0271
Per-channel scales: [0.0013 0.0033 0.0073 0.0175 0.0271 0.0059 0.0092 0.0023]

Per-tensor max error:     0.013417
Per-channel max error:    0.013251

Per-tensor mean error:    0.006463
Per-channel mean error:   0.002101

Mean error per row:
Row | Magnitude | Per-tensor | Per-channel
---------------------------------------------
  0 |    0.10    |   0.004851  |   0.000280
  1 |    0.20    |   0.009060  |   0.000748
  2 |    0.50    |   0.005872  |   0.002023
  3 |    1.00    |   0.007094  |   0.003080
  4 |    2.00    |   0.006667  |   0.006667
  5 |    0.30    |   0.005832  |   0.001644
  6 |    0.80    |   0.006436  |   0.001909
  7 |    0.15    |   0.005891  |   0.000461

Per-channel quantization achieves substantially lower error, especially for rows with small magnitudes. The per-tensor approach is forced to use a scale dictated by the largest row, wasting precision for smaller rows.

Out[15]:
Visualization
Mean absolute quantization error across weight rows with varying magnitudes. Per-channel quantization (orange) adapts the scale to each row, significantly reducing error for small-magnitude rows compared to per-tensor quantization (blue), which is dominated by the largest row.
Mean absolute quantization error across weight rows with varying magnitudes. Per-channel quantization (orange) adapts the scale to each row, significantly reducing error for small-magnitude rows compared to per-tensor quantization (blue), which is dominated by the largest row.

Demonstrating the Outlier Problem

Let's simulate the outlier pattern seen in large language models:

In[16]:
Code
# Simulate activation tensor with outlier channel
batch_size, hidden_dim = 32, 128
activations = torch.randn(batch_size, hidden_dim)

# Inject outliers in channel 50 (simulating LLM behavior)
outlier_channel = 50
activations[:, outlier_channel] *= 20  # 20x larger than typical

# Per-tensor quantization
qt_act, scale_act = absmax_quantize(activations)
recon_act = dequantize(qt_act, scale_act)
error_act = (activations - recon_act).abs()

# Separate error for outlier vs normal channels
normal_mask = torch.ones(hidden_dim, dtype=torch.bool)
normal_mask[outlier_channel] = False

error_normal = error_act[:, normal_mask].mean().item()
error_outlier = error_act[:, outlier_channel].mean().item()
Out[17]:
Console
Activation statistics:
  Normal channels max:  3.83
  Outlier channel max:  38.07

Quantization scale: 0.2998

Mean absolute error:
  Normal channels:  0.075173
  Outlier channel:  0.063882

Relative error (error/value):
  Normal channels:  9.4%
  Outlier channel:  0.4%

The outlier channel forces a large scale factor, causing severe relative error for normal channels. This is exactly the problem that smooth quantization solves.

Out[18]:
Visualization
The outlier activation problem in large language models. The bar chart (left) identifies channel 50 as having extreme magnitude compared to others. The histogram (right) shows how this single outlier expands the range, compressing the representation of the majority of normal values.
The outlier activation problem in large language models. The bar chart (left) identifies channel 50 as having extreme magnitude compared to others. The histogram (right) shows how this single outlier expands the range, compressing the representation of the majority of normal values.
Notebook output

Implementing Smooth Quantization

Now let's implement the smooth quantization transformation:

In[19]:
Code
def compute_smooth_scales(
    activation_absmax: torch.Tensor,  # Per-channel activation max
    weight_absmax: torch.Tensor,  # Per-channel weight max
    alpha: float = 0.5,
) -> torch.Tensor:
    """
    Compute smoothing scale factors.

    Args:
        activation_absmax: Max absolute activation for each input channel
        weight_absmax: Max absolute weight for each input channel (row of W)
        alpha: Migration strength (0=no smoothing, 1=full smoothing)

    Returns:
        scales: Smoothing factors for each channel
    """
    # s_j = (act_max^alpha) / (weight_max^(1-alpha))
    scales = (activation_absmax**alpha) / (weight_absmax ** (1 - alpha))

    # Clamp to avoid extreme values
    scales = scales.clamp(min=1e-5)

    return scales


def apply_smooth_transform(
    activations: torch.Tensor,  # (batch, in_features)
    weights: torch.Tensor,  # (in_features, out_features)
    scales: torch.Tensor,  # (in_features,)
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Apply smoothing transformation to activations and weights.

    Returns:
        smoothed_activations: X @ diag(1/s)
        smoothed_weights: diag(s) @ W
    """
    # Divide activations by scales (broadcast across batch)
    smoothed_activations = activations / scales.unsqueeze(0)

    # Multiply weight rows by scales
    smoothed_weights = weights * scales.unsqueeze(1)

    return smoothed_activations, smoothed_weights

Let's apply smooth quantization to our outlier problem:

In[20]:
Code
# Create weight matrix for the same linear layer
weights_for_layer = torch.randn(hidden_dim, 64) * 0.1  # (in, out)

# Compute channel-wise statistics
act_absmax = activations.abs().max(dim=0).values  # (hidden_dim,)
weight_absmax = weights_for_layer.abs().max(dim=1).values  # (hidden_dim,)

# Compute smooth scales with alpha=0.5
smooth_scales = compute_smooth_scales(act_absmax, weight_absmax, alpha=0.5)

# Apply transformation
smoothed_act, smoothed_weights = apply_smooth_transform(
    activations, weights_for_layer, smooth_scales
)

# Now quantize the smoothed tensors
qt_smooth_act, scale_smooth_act = absmax_quantize(smoothed_act)
recon_smooth_act = dequantize(qt_smooth_act, scale_smooth_act)
error_smooth_act = (smoothed_act - recon_smooth_act).abs()

# Transform error back to original space for comparison
# (multiply back by scales to compare apples-to-apples)
error_smooth_original = error_smooth_act * smooth_scales.unsqueeze(0)
Out[21]:
Console
Before smoothing:
  Activation max (normal): 3.83
  Activation max (outlier): 38.07
  Ratio: 9.9x

After smoothing:
  Activation max (normal): 1.08
  Activation max (outlier): 3.30
  Ratio: 3.1x

Smooth scale for outlier channel: 11.52

Quantization error (in original space):
  Without smoothing:
    Normal channels:  0.075173
    Outlier channel:  0.063882
  With smoothing:
    Normal channels:  0.019993
    Outlier channel:  0.063882

Smooth quantization dramatically reduces the activation range ratio, making per-tensor quantization much more effective. The error for normal channels drops significantly because they're no longer penalized by the outlier's presence.

Visualizing the Smoothing Effect

Out[22]:
Visualization
Bar chart comparing per-channel activation ranges with and without smoothing transformation.
Impact of smooth quantization on activation magnitudes. Before smoothing (left), the outlier channel dominates the range. After smoothing (right), the outlier is suppressed and magnitudes are equilibrated across channels, facilitating effective per-tensor quantization.
Notebook output

End-to-End Quantized Linear Layer

Let's put everything together into a complete quantized linear layer:

In[23]:
Code
class QuantizedLinear:
    """
    INT8 quantized linear layer with optional smooth quantization.
    """

    def __init__(
        self,
        weight: torch.Tensor,  # Original FP32 weights (out, in)
        bias: torch.Tensor = None,
        smooth_scales: torch.Tensor = None,  # If provided, applies smoothing
    ):
        self.bias = bias
        self.smooth_scales = smooth_scales

        # If smoothing, transform weights
        if smooth_scales is not None:
            weight = weight * smooth_scales.unsqueeze(0)  # (out, in) * (in,)

        # Quantize weights (per-channel)
        self.weight_quantized, self.weight_scales = per_channel_quantize(weight)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass with INT8 computation.

        In real implementations, this would use INT8 GEMM kernels.
        Here we simulate the numerical behavior.
        """
        # Apply activation smoothing if configured
        if self.smooth_scales is not None:
            x = x / self.smooth_scales.unsqueeze(0)

        # Quantize activations (per-tensor)
        x_quantized, x_scale = absmax_quantize(x)

        # Simulate INT8 matrix multiplication
        # In hardware: INT8 × INT8 → INT32, accumulated
        # Here we do it in float but with quantized values
        output_int = x_quantized.float() @ self.weight_quantized.float().T

        # Dequantize: multiply by both scales
        output = output_int * x_scale * self.weight_scales.unsqueeze(0)

        if self.bias is not None:
            output = output + self.bias

        return output

Now let's compare the quantized layer against full precision:

In[24]:
Code
# Create a test linear layer
in_features, out_features = 128, 64
weight = torch.randn(out_features, in_features) * 0.1
bias = torch.randn(out_features) * 0.01

# Test input (with outlier)
test_input = torch.randn(16, in_features)
test_input[:, 50] *= 20  # Inject outlier

# Full precision output
fp_output = test_input @ weight.T + bias

# Quantized without smoothing
quant_layer = QuantizedLinear(weight, bias, smooth_scales=None)
quant_output = quant_layer.forward(test_input)

# Compute smooth scales from test input
act_max = test_input.abs().max(dim=0).values
weight_max = weight.abs().max(dim=0).values
smooth_s = compute_smooth_scales(act_max, weight_max, alpha=0.5)

# Quantized with smoothing
smooth_quant_layer = QuantizedLinear(weight, bias, smooth_scales=smooth_s)
smooth_quant_output = smooth_quant_layer.forward(test_input)

# Calculate errors for comparison
error_quant = (fp_output - quant_output).abs()
error_smooth = (fp_output - smooth_quant_output).abs()
Out[25]:
Console
Linear layer output comparison:
  Output shape: torch.Size([16, 64])

  Without smoothing:
    Max error:  0.224213
    Mean error: 0.049900
    Relative error: 3.33%

  With smooth quantization:
    Max error:  0.070532
    Mean error: 0.017435
    Relative error: 1.16%

Smooth quantization reduces both maximum and mean error, making INT8 quantization viable even with outlier activations.

Out[26]:
Visualization
Output fidelity comparison between standard and smooth INT8 quantization. The scatter plot shows smooth quantization (orange) tracking the ideal FP32 diagonal more closely than standard quantization (blue). The error distribution confirms that smooth quantization centers errors near zero with lower variance.
Output fidelity comparison between standard and smooth INT8 quantization. The scatter plot shows smooth quantization (orange) tracking the ideal FP32 diagonal more closely than standard quantization (blue). The error distribution confirms that smooth quantization centers errors near zero with lower variance.
Notebook output

Key Parameters

The key parameters for the INT8 quantization implementation are:

  • alpha: The migration strength in smooth quantization (typically 0.5). Controls how much quantization difficulty is shifted from activations to weights.
  • scale: The step size of the quantization grid. Determined by the maximum absolute value (absmax) in the tensor or channel.
  • smooth_scales: The channel-wise scaling factors derived from calibration data to balance activation and weight magnitudes.

INT8 Accuracy in Practice

The accuracy impact of INT8 quantization varies significantly based on model architecture, size, and task. Here's what research and practice have shown:

Models under 1B parameters typically quantize to INT8 with minimal degradation (less than 1% accuracy loss on benchmarks). The weight distributions are well-behaved, and activation outliers are rare.

Models from 1B to 10B parameters show increasing sensitivity. Without techniques like smooth quantization, accuracy can degrade 2-5% on certain tasks. Perplexity increases are noticeable but often acceptable for deployment.

Models above 10B parameters exhibit the emergent outlier features that make naive INT8 quantization fail. Smooth quantization or similar techniques become essential, restoring accuracy to near-FP16 levels.

Task sensitivity also matters:

  • Simple classification tasks are robust to quantization
  • Open-ended generation shows more degradation as errors compound
  • Mathematical reasoning and code generation are particularly sensitive

The key insight is that INT8 quantization is not a one-size-fits-all solution. Calibration data choice, quantization granularity, and outlier handling all need tuning for optimal results.

Limitations and Practical Considerations

INT8 quantization represents a practical and widely-deployed optimization, but it comes with important limitations that you should understand.

The most fundamental limitation is that some model information is permanently lost during quantization. While smooth quantization and careful calibration minimize this loss, they cannot eliminate it entirely. For safety-critical applications or tasks requiring the highest possible accuracy, the small degradation from INT8 may be unacceptable. This is why many production systems maintain FP16 or FP32 models for evaluation and fine-tuning while deploying INT8 for inference.

Calibration data sensitivity presents another challenge. Smooth quantization computes channel-wise statistics from a calibration dataset, and if this data doesn't represent the actual deployment distribution, the smoothing factors may be suboptimal. A model calibrated on English text may perform poorly when quantized for code completion or multilingual inputs. You must ensure calibration data matches deployment conditions.

Hardware support, while widespread, is not universal. INT8 acceleration requires specific hardware features (like NVIDIA's Tensor Cores or ARM's NEON instructions), and the speedup varies by platform. On CPUs without INT8 vectorization, quantized inference may actually be slower than FP32 due to the quantization/dequantization overhead. Always benchmark on target hardware before committing to a quantization strategy.

Finally, INT8 may not provide sufficient compression for edge deployment. With 8 bits per weight, a 7B parameter model still requires 7GB of storage. For mobile or embedded applications, the more aggressive INT4 quantization techniques we'll explore in the next chapter become necessary, accepting greater accuracy trade-offs for smaller model sizes.

Summary

INT8 quantization provides a practical path to halving model memory and accelerating inference through hardware INT8 units. The key concepts are:

  • Absmax quantization maps floating-point values to INT8 using a single scale factor, centering the quantization grid at zero
  • Per-channel quantization applies separate scales to each output channel, reducing error when channels have different magnitude distributions
  • The outlier problem emerges in large language models where a few activation dimensions have values 10-100x larger than typical, making per-tensor quantization fail
  • Smooth quantization solves this by migrating quantization difficulty from activations to weights through a mathematically equivalent transformation
  • The smoothing factor sjs_j balances activation and weight ranges, computed offline from calibration data and fused into preceding layers

INT8 quantization with smooth quantization achieves near-FP16 accuracy on models up to hundreds of billions of parameters while providing 2x memory reduction and significant throughput improvements on compatible hardware. For even more aggressive compression, the upcoming chapters on INT4 quantization, GPTQ, and AWQ explore techniques that push quantization to just 4 bits per weight.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about INT8 quantization techniques.

Loading component...

Reference

BIBTEXAcademic
@misc{int8quantizationabsmaxsmoothquantizationimplementation, author = {Michael Brenndoerfer}, title = {INT8 Quantization: Absmax, Smooth Quantization & Implementation}, year = {2026}, url = {https://mbrenndoerfer.com/writing/int8-quantization-absmax-smooth-quantization-implementation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). INT8 Quantization: Absmax, Smooth Quantization & Implementation. Retrieved from https://mbrenndoerfer.com/writing/int8-quantization-absmax-smooth-quantization-implementation
MLAAcademic
Michael Brenndoerfer. "INT8 Quantization: Absmax, Smooth Quantization & Implementation." 2026. Web. today. <https://mbrenndoerfer.com/writing/int8-quantization-absmax-smooth-quantization-implementation>.
CHICAGOAcademic
Michael Brenndoerfer. "INT8 Quantization: Absmax, Smooth Quantization & Implementation." Accessed today. https://mbrenndoerfer.com/writing/int8-quantization-absmax-smooth-quantization-implementation.
HARVARDAcademic
Michael Brenndoerfer (2026) 'INT8 Quantization: Absmax, Smooth Quantization & Implementation'. Available at: https://mbrenndoerfer.com/writing/int8-quantization-absmax-smooth-quantization-implementation (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). INT8 Quantization: Absmax, Smooth Quantization & Implementation. https://mbrenndoerfer.com/writing/int8-quantization-absmax-smooth-quantization-implementation