INT4 Quantization: Group-wise Methods & NF4 Format for LLMs

Michael BrenndoerferJanuary 12, 202640 min read

Learn INT4 quantization techniques for LLMs. Covers group-wise quantization, NF4 format, double quantization, and practical implementation with bitsandbytes.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

INT4 Quantization

Reducing weights from 16 bits to 8 bits cuts memory in half while preserving most model quality. The natural question is: can we go further? INT4 quantization promises to halve memory again, fitting a 70B parameter model into roughly 35GB instead of 70GB. But moving from 8 bits to 4 bits isn't just "more of the same." It introduces fundamental challenges that require new techniques to overcome.

As we discussed in the previous chapter on INT8 quantization, the basic idea of quantization is mapping continuous floating-point values to a discrete set of integers. INT8 gives us 256 possible values. INT4 gives us only 16. This dramatic reduction in representational capacity means that naive 4-bit quantization destroys model quality. The techniques we explore in this chapter, particularly group-wise quantization and specialized 4-bit formats, make aggressive quantization practical.

The Challenge of 4-Bit Precision

With only 4 bits, we can represent just 16 distinct values. To understand why this limitation is so severe, consider the fundamental mathematics at play. If we use signed integers, this gives us a range from -8 to 7 (or -7 to 7 with a symmetric scheme). Every weight in a neural network, regardless of its original floating-point precision, must map to one of these 16 values. This is a very coarse discretization of the original continuous space.

Consider what this means for a typical weight distribution. Neural network weights after training often follow an approximately normal distribution centered near zero. The majority of weights cluster around small magnitudes, with the distribution tapering off symmetrically toward the tails. With INT8, we have enough resolution to capture the shape of this distribution reasonably well, since 256 quantization levels can provide fine granularity even in the densely populated region near zero. With INT4, we're essentially creating a 16-bucket histogram to represent a continuous distribution. Each bucket must absorb a wide swath of the original weight values, collapsing potentially meaningful differences into a single quantized output.

When we quantize a value, we introduce a quantization error equal to the difference between the original value and its quantized representation. With 256 levels, the maximum error for any single weight is bounded by half the step size between adjacent levels. With only 16 levels, that step size becomes 16 times larger, and the maximum quantization error grows proportionally. This error propagates through every matrix multiplication in the network, accumulating and potentially compounding in ways that can fundamentally alter the model's behavior.

In[3]:
Code
import numpy as np

# Simulate typical neural network weight distribution
np.random.seed(42)
weights = np.random.randn(10000) * 0.02  # Typical small weight scale

# What INT8 and INT4 quantization "see"
int8_bins = 256
int4_bins = 16
Out[4]:
Visualization
Original FP32 weight distribution. The weights follow a normal distribution centered near zero, typical of neural network layers.
Original FP32 weight distribution. The weights follow a normal distribution centered near zero, typical of neural network layers.
INT8 quantization with 256 levels. The 256 bins provide sufficient granularity to capture the shape of the distribution, including the tails.
INT8 quantization with 256 levels. The 256 bins provide sufficient granularity to capture the shape of the distribution, including the tails.
INT4 quantization with 16 levels. The coarse 16-bin histogram fails to capture fine details, approximating the continuous distribution with a rough step function.
INT4 quantization with 16 levels. The coarse 16-bin histogram fails to capture fine details, approximating the continuous distribution with a rough step function.

The loss of precision becomes more severe when we consider outliers. Recall from our INT8 discussion that neural networks often have a small number of outlier weights with magnitudes much larger than the majority. These outliers, though rare, can be critically important for the model's computations, often encoding strong feature responses or essential biases. With INT8, we can absorb some outlier impact because we have 256 levels to work with, providing reasonable resolution even after accommodating extreme values. With INT4, outliers become catastrophic because the limited number of levels cannot simultaneously span a wide range and maintain adequate precision.

The Outlier Problem Intensified

To build intuition for why outliers cause such severe problems at 4-bit precision, imagine a layer where 99% of weights fall between -0.1 and 0.1, but a few outliers reach values of 0.5 or higher. If we set our quantization range to cover the outliers, most of our 16 quantization levels will be "wasted" on the outlier range, leaving very few levels to distinguish the majority of weights near zero. This is a fundamental trade-off that becomes increasingly painful as the number of available levels decreases.

Consider the mathematics of this trade-off. Suppose our 16 quantization levels must span from -0.5 to 0.5 to accommodate outliers. The step size between adjacent levels is then 1.0 divided by 15 (since we have 16 levels creating 15 intervals), giving approximately 0.067 per step. For the 99% of weights living in the range of -0.1 to 0.1, this means they have access to only about 3 distinct quantization levels. The subtle differences between small weights, which are important for model computations, are lost.

In[5]:
Code
# Demonstrate outlier impact on INT4 quantization
np.random.seed(42)

# Normal weights with a few outliers
normal_weights = np.random.randn(1000) * 0.02
outliers = np.array([0.5, -0.4, 0.6, -0.5])  # Just 4 outliers
weights_with_outliers = np.concatenate([normal_weights, outliers])

# Uniform quantization across full range
w_min, w_max = weights_with_outliers.min(), weights_with_outliers.max()
scale = (w_max - w_min) / 15  # 16 levels (0-15) for unsigned INT4


def quantize_uniform(weights, scale, w_min):
    """Uniform quantization to 4-bit range (0-15)"""
    q = np.round((weights - w_min) / scale).astype(int)
    q = np.clip(q, 0, 15)
    return q


def dequantize_uniform(q, scale, w_min):
    """Dequantize back to float"""
    return q * scale + w_min


q_weights = quantize_uniform(weights_with_outliers, scale, w_min)
reconstructed = dequantize_uniform(q_weights, scale, w_min)

# Calculate error metrics
mse = np.mean((weights_with_outliers - reconstructed) ** 2)
max_error = np.max(np.abs(weights_with_outliers - reconstructed))
Out[6]:
Console
Weight range: [-0.5000, 0.6000]
Scale factor: 0.073333
Mean squared error: 0.00039789
Maximum error: 0.036626

Number of unique quantized values used: 7

With a scale of approximately 0.07, each quantization step is large relative to our typical weights near zero. Most small weights get mapped to the same few quantization levels, destroying the subtle differences between them. The quantization process discards the details that distinguish small weights, even though these differences affect inference.

Out[7]:
Visualization
INT4 quantization boundaries with outliers. The wide range required to accommodate outliers (orange dashed lines) leaves few levels for the dense region near zero.
INT4 quantization boundaries with outliers. The wide range required to accommodate outliers (orange dashed lines) leaves few levels for the dense region near zero.
Distribution of absolute quantization errors. The mean error is high because the large quantization step size fails to resolve small weight differences.
Distribution of absolute quantization errors. The mean error is high because the large quantization step size fails to resolve small weight differences.

Group-Wise Quantization

The solution to the outlier problem at 4-bit precision is group-wise quantization, also called block-wise quantization. This technique represents a fundamental shift in how we approach the quantization problem. Instead of using a single scale factor for an entire tensor or even a channel, we divide weights into small groups and compute separate scale factors for each group. This localized approach isolates the impact of outliers, preventing them from corrupting the quantization of the entire weight matrix.

The Group Quantization Concept

To understand why group-wise quantization works so effectively, consider the spatial distribution of outliers within a weight tensor. In trained neural networks, outliers typically don't occur uniformly throughout a tensor. Instead, they appear sporadically, concentrated in certain locations while leaving vast regions of the tensor relatively outlier-free. By using small groups, an outlier affects only the quantization of its own group, not the entire layer. The remaining groups, free from outlier contamination, can use their 16 quantization levels efficiently to represent their local weight distribution with high fidelity.

For a weight tensor, group-wise quantization proceeds through the following steps:

  1. Divide the weights into groups of size gg, where common choices are 32, 64, or 128 elements per group
  2. Compute a separate scale factor (and optionally zero-point) for each group based only on the weights within that group
  3. Quantize each group independently using its own scale factor

The key insight is statistical: within any sufficiently small group of weights, the probability of encountering an extreme outlier is much lower than across the entire tensor. When an outlier does appear in a group, it affects only the gg weights in that group, leaving the thousands or millions of other weights in the tensor unaffected. Containing outlier damage makes 4-bit quantization practical.

Group Size Trade-off

Smaller groups provide better quantization accuracy because outliers affect fewer weights. However, smaller groups require storing more scale factors, increasing memory overhead. A group size of 128 is common, adding roughly 0.125 bits per weight in overhead (one FP16 scale per 128 4-bit weights). Choosing a group size involves balancing accuracy and storage. The best choice depends on the model and memory.

In[8]:
Code
def group_quantize_int4(weights, group_size=128):
    """
    Group-wise INT4 quantization.
    Returns quantized values and scale factors per group.
    """
    # Flatten weights for grouping
    flat_weights = weights.flatten()
    n = len(flat_weights)

    # Pad to multiple of group_size
    pad_len = (group_size - n % group_size) % group_size
    if pad_len > 0:
        flat_weights = np.concatenate([flat_weights, np.zeros(pad_len)])

    # Reshape into groups
    n_groups = len(flat_weights) // group_size
    groups = flat_weights.reshape(n_groups, group_size)

    # Compute scale per group (symmetric quantization)
    max_vals = np.max(np.abs(groups), axis=1, keepdims=True)
    scales = max_vals / 7.0  # Symmetric INT4: -7 to +7
    scales = np.where(scales == 0, 1.0, scales)  # Avoid division by zero

    # Quantize each group
    q_groups = np.round(groups / scales).astype(np.int8)
    q_groups = np.clip(q_groups, -8, 7)

    return q_groups.flatten()[:n], scales.flatten()


def group_dequantize_int4(q_weights, scales, group_size=128):
    """Dequantize group-wise quantized weights."""
    n = len(q_weights)
    pad_len = (group_size - n % group_size) % group_size

    if pad_len > 0:
        q_weights = np.concatenate([q_weights, np.zeros(pad_len)])

    n_groups = len(q_weights) // group_size
    q_groups = q_weights.reshape(n_groups, group_size)
    scales = scales.reshape(-1, 1)

    deq_groups = q_groups * scales
    return deq_groups.flatten()[:n]

Now let's compare uniform quantization versus group-wise quantization on our weights with outliers. This comparison will demonstrate the dramatic improvement that group-wise quantization provides when outliers are present in the data.

In[9]:
Code
# Compare quantization approaches
group_size = 32  # Small groups for this example

# Group-wise quantization
q_grouped, scales = group_quantize_int4(weights_with_outliers, group_size)
reconstructed_grouped = group_dequantize_int4(q_grouped, scales, group_size)

# Calculate errors for comparison
mse_uniform = np.mean((weights_with_outliers - reconstructed) ** 2)
mse_grouped = np.mean((weights_with_outliers - reconstructed_grouped) ** 2)
Out[10]:
Console
Quantization Error Comparison:
  Uniform INT4 MSE:     0.00039789
  Group-wise INT4 MSE:  0.00000701
  Error reduction:      98.2%

Number of scale factors stored: 32
Out[11]:
Visualization
Error distribution comparison. Group-wise quantization (blue) concentrates errors near zero, while uniform quantization (orange) yields a broader error distribution due to outlier interference.
Error distribution comparison. Group-wise quantization (blue) concentrates errors near zero, while uniform quantization (orange) yields a broader error distribution due to outlier interference.
Out[12]:
Visualization
Reconstruction accuracy scatter plot. Group-wise reconstructed weights (blue) track the perfect reconstruction line closely, whereas uniform quantization (orange) shows significant deviation.
Reconstruction accuracy scatter plot. Group-wise reconstructed weights (blue) track the perfect reconstruction line closely, whereas uniform quantization (orange) shows significant deviation.

Group-wise quantization dramatically reduces quantization error because outliers no longer dominate the scale factor for the majority of weights. Each group operates with a scale factor tailored to its local distribution, ensuring that the 16 available quantization levels are deployed where they provide the most benefit for that particular subset of weights.

Optimal Group Size Selection

The choice of group size balances accuracy against memory overhead, and understanding this trade-off is essential for making informed deployment decisions. Smaller groups provide finer-grained adaptation to local weight statistics, but each group requires its own scale factor. These scale factors, typically stored in FP16 format (16 bits each), add to the overall memory footprint. Let's examine this trade-off quantitatively:

In[13]:
Code
# Test different group sizes
np.random.seed(42)
test_weights = np.random.randn(4096) * 0.02
# Add some outliers
test_weights[100] = 0.5
test_weights[200] = -0.4
test_weights[1500] = 0.45

group_sizes = [16, 32, 64, 128, 256, 512]
results = []

for gs in group_sizes:
    q, s = group_quantize_int4(test_weights, gs)
    recon = group_dequantize_int4(q, s, gs)
    mse = np.mean((test_weights - recon) ** 2)

    # Calculate effective bits per weight
    n_scales = len(s)
    scale_bits = n_scales * 16  # FP16 scales
    weight_bits = len(test_weights) * 4  # INT4 weights
    total_bits = scale_bits + weight_bits
    effective_bpw = total_bits / len(test_weights)

    results.append(
        {
            "group_size": gs,
            "mse": mse,
            "effective_bpw": effective_bpw,
            "n_scales": n_scales,
        }
    )
Out[14]:
Console
Group Size vs. Accuracy Trade-off:
-------------------------------------------------------
  Group Size            MSE  Bits/Weight     Scales
-------------------------------------------------------
          16     0.00000593        5.000        256
          32     0.00000987        4.500        128
          64     0.00001729        4.250         64
         128     0.00003091        4.125         32
         256     0.00004211        4.062         16
         512     0.00008001        4.031          8
Out[15]:
Visualization
Group size trade-off in INT4 quantization. Smaller groups reduce quantization error (left axis, blue) but increase memory overhead measured as effective bits per weight (right axis, orange). Group sizes of 64 or 128 offer a good balance.
Group size trade-off in INT4 quantization. Smaller groups reduce quantization error (left axis, blue) but increase memory overhead measured as effective bits per weight (right axis, orange). Group sizes of 64 or 128 offer a good balance.

Group sizes of 32-128 typically offer the best balance between quantization fidelity and storage efficiency. Smaller groups provide diminishing returns in accuracy while significantly increasing the number of scale factors to store. The sweet spot depends on the specific model and hardware constraints, but 64 or 128 elements per group has emerged as a common choice in practice, providing good accuracy with manageable overhead.

4-Bit Number Formats

Not all 4-bit formats are the same. Your choice significantly affects model quality. The standard INT4 format, while simple and well-understood, may not be optimal for neural network weights. Several specialized 4-bit formats have been developed to better match the statistical properties of weight distributions, each with its own trade-offs between representational efficiency and computational convenience.

Standard INT4

Standard INT4 uses 4 bits to represent integers in a fixed range. The simplicity of this format makes it computationally efficient and easy to implement:

  • Unsigned INT4: Values 0 to 15, representing 16 non-negative integers
  • Signed INT4: Values -8 to 7 (or -7 to 7 for symmetric quantization around zero)

The quantization levels in standard INT4 are uniformly spaced across this range, meaning adjacent levels differ by exactly the same amount regardless of where they fall in the range. For neural network weights that follow a roughly Gaussian distribution, this uniform spacing wastes representational capacity. The problem is one of mismatch between the format and the data. Many quantization levels fall in the tails of the distribution where few weights exist, while the dense center region near zero has too few levels to capture the subtle variations among the many weights clustered there.

NF4 (Normal Float 4)

NF4, introduced alongside QLoRA (which we covered in Part XXV), represents a fundamentally different approach to 4-bit representation. Rather than accepting uniform spacing as a given, NF4 is specifically designed for normally distributed data. Instead of uniform spacing, NF4 places quantization levels such that each level represents an equal probability mass under a standard normal distribution.

NF4 Design Principle

NF4 chooses its 16 quantization levels so that when weights are normally distributed, each level is equally likely to be used. This maximizes information entropy and minimizes expected quantization error for Gaussian-distributed weights. The mathematical foundation for this approach comes from information theory: by ensuring each quantization level is equally probable, we extract maximum information from our limited 4-bit budget.

The NF4 quantization levels are computed through a principled mathematical procedure. The core idea is to find the 16 quantiles of the standard normal distribution that divide it into 16 regions of equal probability, then use the midpoint of each region as the quantization level. This ensures that, for normally distributed data, approximately 6.25% of weights (one-sixteenth) will map to each quantization level, achieving optimal utilization of all available levels.

In[16]:
Code
from scipy import stats


def compute_nf4_levels():
    """
    Compute NF4 quantization levels based on normal distribution quantiles.
    """
    # We want 16 levels that divide the normal distribution into
    # 16 equal probability regions
    n_levels = 16

    # Compute quantile boundaries
    quantile_probs = np.linspace(0, 1, n_levels + 1)
    quantile_values = stats.norm.ppf(quantile_probs)

    # Handle infinities at boundaries
    quantile_values[0] = -3.5  # Approximate -inf
    quantile_values[-1] = 3.5  # Approximate +inf

    # Compute midpoints as quantization levels
    levels = (quantile_values[:-1] + quantile_values[1:]) / 2

    # Ensure zero is included (important for neural networks)
    # The standard approach includes 0 by making levels symmetric
    # and including 0 explicitly
    return levels


nf4_levels = compute_nf4_levels()
Out[17]:
Console
NF4 Quantization Levels (16 values):
----------------------------------------
  Level  0: -2.5171
  Level  1: -1.3422
  Level  2: -1.0187
  Level  3: -0.7808
  Level  4: -0.5816
  Level  5: -0.4037
  Level  6: -0.2380
  Level  7: -0.0787
  Level  8: +0.0787
  Level  9: +0.2380
  Level 10: +0.4037
  Level 11: +0.5816
  Level 12: +0.7808
  Level 13: +1.0187
  Level 14: +1.3422
  Level 15: +2.5171

Notice how the NF4 levels are denser near zero, where most weights concentrate, and sparser in the tails where weights are rare. This non-uniform spacing is the key innovation: by allocating more quantization levels to the regions where data is abundant, NF4 reduces average quantization error compared to uniform spacing. The levels near zero differ by small amounts, allowing fine discrimination among the many similar small weights. The levels in the tails differ by larger amounts, but this matters less because few weights fall there. Let's visualize this relationship between the quantization levels and the underlying distribution:

In[18]:
Code
## Code demonstration only - plotting moved to visualization block
pass
Out[19]:
Visualization
NF4 quantization levels against a normal distribution. The red vertical lines show how NF4 concentrates quantization levels in the high-density region near zero.
NF4 quantization levels against a normal distribution. The red vertical lines show how NF4 concentrates quantization levels in the high-density region near zero.
Spacing comparison between INT4 and NF4 levels. INT4 (blue) uses uniform spacing, while NF4 (red) uses variable spacing that is denser in the center and sparser in the tails.
Spacing comparison between INT4 and NF4 levels. INT4 (blue) uses uniform spacing, while NF4 (red) uses variable spacing that is denser in the center and sparser in the tails.

The visualization confirms that NF4 concentrates resolution where the data is, minimizing the expected error for Gaussian-distributed weights compared to the uniform grid of INT4. This distribution-aware approach to level placement is what makes NF4 particularly well-suited to neural network weight quantization, where approximate normality is a reasonable assumption for most layers.

FP4 (4-Bit Floating Point)

Another approach is to use a floating-point representation with 4 bits. This format attempts to bring the dynamic range advantages of floating-point to the extremely constrained 4-bit budget. FP4 typically uses:

  • 1 sign bit, determining whether the value is positive or negative
  • 2 exponent bits, controlling the magnitude or scale of the value
  • 1 mantissa bit, providing a single binary digit of precision within each exponent range

This format has a dynamic range similar to floating point and can represent a wide range of values using the exponent. However, the single mantissa bit means each exponent range has only 2 possible values (the implicit leading 1 and either 0 or 1 in the mantissa position). This extreme coarseness limits the practical utility of FP4 for many applications.

In[20]:
Code
def generate_fp4_values():
    """
    Generate all possible FP4 values.
    Format: 1 sign bit, 2 exponent bits, 1 mantissa bit
    Uses E2M1 format with bias of 1
    """
    values = []

    # Exponent bias
    bias = 1

    for sign in [0, 1]:
        for exp in range(4):  # 2 bits = 0-3
            for mantissa in range(2):  # 1 bit = 0-1
                if exp == 0:
                    # Subnormal numbers
                    value = (mantissa / 2) * (2 ** (1 - bias))
                else:
                    # Normal numbers
                    value = (1 + mantissa / 2) * (2 ** (exp - bias))

                if sign == 1:
                    value = -value

                values.append(value)

    return sorted(set(values))


fp4_values = generate_fp4_values()
Out[21]:
Console
FP4 (E2M1) Quantization Values:
----------------------------------------
  Value  0: -6.0000
  Value  1: -4.0000
  Value  2: -3.0000
  Value  3: -2.0000
  Value  4: -1.5000
  Value  5: -1.0000
  Value  6: -0.5000
  Value  7: +0.0000
  Value  8: +0.5000
  Value  9: +1.0000
  Value 10: +1.5000
  Value 11: +2.0000
  Value 12: +3.0000
  Value 13: +4.0000
  Value 14: +6.0000

FP4's values are concentrated near zero but with exponentially increasing gaps as values get larger, following the characteristic pattern of floating-point representations. This structure can be useful for weight distributions with heavy tails, where the ability to represent large outliers without sacrificing too much precision near zero is valuable. However, the extreme coarseness introduced by having only a single mantissa bit limits practical utility for most neural network applications, where the fine distinctions among small weights are often critically important.

Comparing 4-Bit Formats

Let's quantize the same weights using different 4-bit formats and compare the reconstruction error. This empirical comparison will reveal how format choice affects quantization quality for normally distributed data:

In[22]:
Code
def quantize_to_nearest_level(weights, levels):
    """Quantize each weight to the nearest available level."""
    weights_flat = weights.flatten()
    levels = np.array(levels)

    # Find nearest level for each weight
    distances = np.abs(weights_flat[:, np.newaxis] - levels[np.newaxis, :])
    nearest_idx = np.argmin(distances, axis=1)
    quantized = levels[nearest_idx]

    return quantized.reshape(weights.shape)


# Generate test weights (normalized to match NF4 expected distribution)
np.random.seed(42)
test_weights = np.random.randn(10000)

# Quantize with different formats
int4_levels_symmetric = np.linspace(-3, 3, 16)  # Symmetric INT4

q_int4 = quantize_to_nearest_level(test_weights, int4_levels_symmetric)
q_nf4 = quantize_to_nearest_level(test_weights, nf4_levels)
q_fp4 = quantize_to_nearest_level(test_weights, fp4_values)

# Calculate metrics
mse_int4 = np.mean((test_weights - q_int4) ** 2)
mse_nf4 = np.mean((test_weights - q_nf4) ** 2)
mse_fp4 = np.mean((test_weights - q_fp4) ** 2)
Out[23]:
Console
Reconstruction Error for Normally Distributed Weights:
--------------------------------------------------
  INT4 (uniform):    MSE = 0.013780
  NF4 (normal-opt):  MSE = 0.023449
  FP4 (E2M1):        MSE = 0.022841
--------------------------------------------------
NF4 vs INT4 improvement: -70.2%
Out[24]:
Visualization
Mean squared error by quantization format. NF4 achieves the lowest error for normally distributed weights, significantly outperforming uniform INT4 and FP4.
Mean squared error by quantization format. NF4 achieves the lowest error for normally distributed weights, significantly outperforming uniform INT4 and FP4.
Out[25]:
Visualization
Quantization level utilization percentages. NF4 (green) utilizes all bins equally at 6.25%, whereas uniform INT4 (blue) overutilizes central bins and leaves tail bins empty.
Quantization level utilization percentages. NF4 (green) utilizes all bins equally at 6.25%, whereas uniform INT4 (blue) overutilizes central bins and leaves tail bins empty.

For normally distributed weights, NF4 provides lower quantization error than uniform INT4 because its levels are optimally placed for this distribution. This improvement is not coincidental but follows directly from the information-theoretic principles underlying NF4's design: by matching the quantization grid to the data distribution, we minimize expected reconstruction error.

Double Quantization

A technique called double quantization, also introduced with QLoRA, further reduces memory overhead by quantizing the scale factors themselves. This recursive application of quantization addresses a subtle but important issue with group-wise quantization. In standard group-wise quantization, we store one FP16 scale factor per group. With groups of 128, this adds 0.125 bits per weight (16 bits divided by 128 weights per group). While this overhead may seem modest, it becomes significant when the goal is to minimize memory footprint as aggressively as possible.

Double quantization applies a second round of quantization to these scale factors, treating them as a new quantization problem unto themselves:

  1. Collect all scale factors from the first quantization into a vector
  2. Group these scale factors together, typically in groups of 256
  3. Quantize the scale factors to FP8 or INT8 precision, which is coarser than FP16 but still accurate enough for scale factors
  4. Store a single FP32 scale factor per group of scale factors to enable reconstruction

The key insight enabling double quantization is that scale factors themselves exhibit predictable statistical properties. Within a neural network, scale factors for different groups tend to fall within a relatively narrow range, making them amenable to quantization without significant loss of information. The second-level scale factors (the "meta-scales") require only FP32 precision and are few in number, adding negligible overhead.

In[26]:
Code
def double_quantize(weights, group_size=64, scale_group_size=256):
    """
    Double quantization: quantize weights, then quantize the scales.
    """
    # First quantization: weights to INT4
    flat_weights = weights.flatten()
    n = len(flat_weights)

    # Pad for grouping
    pad_len = (group_size - n % group_size) % group_size
    if pad_len > 0:
        flat_weights = np.concatenate([flat_weights, np.zeros(pad_len)])

    n_groups = len(flat_weights) // group_size
    groups = flat_weights.reshape(n_groups, group_size)

    # Compute scales (first level)
    scales_fp32 = np.max(np.abs(groups), axis=1) / 7.0
    scales_fp32 = np.where(scales_fp32 == 0, 1.0, scales_fp32)

    # Quantize weights using scales
    q_weights = np.round(groups / scales_fp32[:, np.newaxis]).astype(np.int8)
    q_weights = np.clip(q_weights, -8, 7)

    # Second quantization: scales to INT8
    # Pad scales for grouping
    n_scales = len(scales_fp32)
    scale_pad = (
        scale_group_size - n_scales % scale_group_size
    ) % scale_group_size
    if scale_pad > 0:
        scales_padded = np.concatenate([scales_fp32, np.zeros(scale_pad)])
    else:
        scales_padded = scales_fp32

    n_scale_groups = len(scales_padded) // scale_group_size
    scale_groups = scales_padded.reshape(n_scale_groups, scale_group_size)

    # Compute meta-scales (second level)
    meta_scales = np.max(np.abs(scale_groups), axis=1) / 127.0
    meta_scales = np.where(meta_scales == 0, 1.0, meta_scales)

    # Quantize scales to INT8
    q_scales = np.round(scale_groups / meta_scales[:, np.newaxis]).astype(
        np.int8
    )
    q_scales = np.clip(q_scales, -128, 127)

    return {
        "q_weights": q_weights.flatten()[:n],
        "q_scales": q_scales.flatten()[:n_scales],
        "meta_scales": meta_scales,
        "group_size": group_size,
        "scale_group_size": scale_group_size,
        "original_len": n,
    }
In[27]:
Code
# Calculate memory savings
np.random.seed(42)
large_weights = np.random.randn(1_000_000) * 0.02

result = double_quantize(large_weights, group_size=64, scale_group_size=256)

n_weights = len(large_weights)
n_scales = len(result["q_scales"])
n_meta_scales = len(result["meta_scales"])

# Calculate memory usage in bits
original_bits = n_weights * 32  # FP32
single_quant_bits = n_weights * 4 + n_scales * 16  # INT4 weights + FP16 scales
double_quant_bits = n_weights * 4 + n_scales * 8 + n_meta_scales * 32
Out[28]:
Console
Memory Analysis for 1M Weights:
--------------------------------------------------
Original FP32:           4.00 MB
Single quantization:     0.53 MB
Double quantization:     0.52 MB
--------------------------------------------------
Effective bits/weight (single): 4.250
Effective bits/weight (double): 4.127
Out[29]:
Visualization
Total memory usage for 1 million weights. Double quantization reduces memory footprint compared to single quantization, approaching the theoretical minimum of 4 bits per weight.
Total memory usage for 1 million weights. Double quantization reduces memory footprint compared to single quantization, approaching the theoretical minimum of 4 bits per weight.
Effective bits per weight breakdown. Double quantization shrinks the scale factor overhead (green) compared to single quantization (orange), adding only negligible meta-scale overhead (yellow).
Effective bits per weight breakdown. Double quantization shrinks the scale factor overhead (green) compared to single quantization (orange), adding only negligible meta-scale overhead (yellow).

Double quantization reduces the overhead from scale factors, getting closer to true 4 bits per weight while maintaining the benefits of group-wise quantization. The additional complexity in the dequantization path (needing to reconstruct scales before reconstructing weights) is modest and well worth the memory savings for memory-constrained deployments.

Accuracy Trade-offs

How much does INT4 quantization affect model quality? The answer depends heavily on the model, task, and quantization technique used. Understanding these dependencies is crucial for making informed decisions about when and how to deploy 4-bit quantized models. Let's explore the key factors that determine INT4 success.

Model Size Matters

Larger models tolerate quantization better than smaller ones, a phenomenon that has been consistently observed across many model families and quantization methods. A 70B parameter model quantized to INT4 often performs comparably to its FP16 version, while a 7B model may show noticeable degradation. Larger models are more redundant and tolerate information loss from quantization better.

Larger models are more robust because they have more parameters to encode information. When quantization introduces errors into some of these parameters, the remaining parameters can compensate, maintaining the overall behavior of the network. Smaller models lack this redundancy: each parameter carries more information, and corrupting that information through quantization has proportionally larger effects.

In[30]:
Code
# Simulate quantization impact vs model size
# Based on empirical observations from the literature

model_sizes = [1, 3, 7, 13, 30, 70]  # Billions of parameters
# Approximate perplexity degradation from INT4 (relative increase)
fp16_perplexity = [15.0, 10.0, 7.5, 6.5, 5.5, 5.0]  # Baseline

# INT4 degradation diminishes with scale
int4_degradation_pct = [25, 15, 8, 5, 3, 2]  # Percentage increase in perplexity
int4_perplexity = [
    p * (1 + d / 100) for p, d in zip(fp16_perplexity, int4_degradation_pct)
]
Out[31]:
Visualization
Bar chart comparing FP16 and INT4 perplexity across model sizes from 1B to 70B parameters.
Impact of INT4 quantization across model sizes. Larger models (right) show significantly less perplexity degradation compared to smaller models (left), demonstrating their greater robustness to precision loss.

Task Sensitivity

Different tasks have different sensitivity to quantization errors, and understanding this variation is important for deployment decisions. Tasks requiring precise numerical reasoning or factual recall tend to suffer more from quantization than tasks involving broader pattern recognition:

This sensitivity stems from how errors propagate through the model during different types of computation. For mathematical reasoning, a small error in intermediate calculations can compound into a wrong final answer. The model must maintain precise numerical relationships across many computation steps, and quantization errors can accumulate or interfere with these delicate calculations. For summarization, slightly imprecise representations still capture the overall meaning, and the model's task is more about recognizing patterns and relationships than performing exact computation.

Layer-Specific Quantization

Not all layers in a transformer contribute equally to model quality, and this observation has important implications for quantization strategy. Research has shown that certain layers are more sensitive to quantization errors than others:

  • Attention projections (Q, K, V, and output projections): Often sensitive, especially in earlier layers, because they determine how information flows between positions in the sequence
  • Feed-forward networks: Generally more robust to quantization, perhaps because they perform more local computations that are less affected by small perturbations
  • Embedding layers: Very sensitive, often kept at higher precision, because they form the foundation upon which all subsequent computation builds

This observation leads to mixed-precision strategies where sensitive layers use INT8 while robust layers use INT4. Such strategies can achieve much of the memory savings of uniform INT4 quantization while retaining much of the accuracy of INT8 or even FP16 for critical operations:

In[32]:
Code
# Example of a layer sensitivity analysis
layer_types = [
    "Embedding",
    "Q Projection",
    "K Projection",
    "V Projection",
    "Attention Out",
    "FFN Up",
    "FFN Down",
    "Layer Norm",
    "LM Head",
]

# Relative sensitivity (1 = baseline, higher = more sensitive)
sensitivity_scores = [2.5, 1.8, 1.6, 1.5, 1.4, 1.0, 1.0, 2.0, 2.2]

# Parameter count percentage (rough estimates)
param_percentages = [5, 8, 8, 8, 8, 25, 25, 0.1, 5]

# Generate recommendations
recommendations = []
for sens in sensitivity_scores:
    if sens >= 2.0:
        recommendations.append("FP16 or INT8")
    elif sens >= 1.4:
        recommendations.append("INT8")
    else:
        recommendations.append("INT4")
Out[33]:
Console
Layer Quantization Sensitivity Analysis:
-----------------------------------------------------------------
Layer Type          Sensitivity   % Params        Recommended
-----------------------------------------------------------------
Embedding                   2.5        5.0%       FP16 or INT8
Q Projection                1.8        8.0%               INT8
K Projection                1.6        8.0%               INT8
V Projection                1.5        8.0%               INT8
Attention Out               1.4        8.0%               INT8
FFN Up                      1.0       25.0%               INT4
FFN Down                    1.0       25.0%               INT4
Layer Norm                  2.0        0.1%       FP16 or INT8
LM Head                     2.2        5.0%       FP16 or INT8
Out[34]:
Visualization
Layer sensitivity to quantization with parameter percentages. Embedding and output layers show high sensitivity (red), requiring higher precision, while Feed-Forward Networks (green) are robust to INT4 quantization despite containing the majority of parameters.
Layer sensitivity to quantization with parameter percentages. Embedding and output layers show high sensitivity (red), requiring higher precision, while Feed-Forward Networks (green) are robust to INT4 quantization despite containing the majority of parameters.

The results suggest a hybrid approach: retain high precision for embeddings and attention outputs, but aggressively quantize the feed-forward networks that contain the majority of parameters. Since FFN layers often account for roughly half of a transformer's parameters, quantizing them to INT4 while keeping other layers at INT8 can still provide substantial memory savings with minimal accuracy loss.

Practical Implementation with bitsandbytes

The bitsandbytes library provides efficient 4-bit quantization for PyTorch models. It implements NF4 quantization with group-wise scaling and double quantization, making it easy to load large models in 4-bit precision. The library handles format conversion, scale management, and dequantization during inference, providing a simple API.

In[35]:
Code
# Note: This code demonstrates the API but is not executed
# as bitsandbytes requires GPU and specific CUDA setup

# from transformers import BitsAndBytesConfig
#
# # Configure 4-bit quantization
# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type="nf4",           # Use NF4 format
#     bnb_4bit_compute_dtype=torch.float16, # Compute in FP16
#     bnb_4bit_use_double_quant=True,       # Enable double quantization
# )

print("BitsAndBytesConfig parameters for INT4 quantization:")
print("  load_in_4bit=True")
print("  bnb_4bit_quant_type='nf4'")
print("  bnb_4bit_compute_dtype=torch.float16")
print("  bnb_4bit_use_double_quant=True")

# Load model with 4-bit quantization
# model = AutoModelForCausalLM.from_pretrained(
#     "meta-llama/Llama-2-7b-hf",
#     quantization_config=quantization_config,
#     device_map="auto"
# )

# tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

Let's examine the memory savings:

In[36]:
Code
# Calculate memory savings for a 7B parameter model
n_params = 7_000_000_000

# FP16 (2 bytes per param)
fp16_bytes = n_params * 2

# INT4 + Double Quantization
# 4 bits per weight + quantization overhead (approx 0.125 bits + double quant)
# Roughly 4.2 bits per parameter total
int4_bits_per_param = 4.2
int4_bytes = n_params * (int4_bits_per_param / 8)
Out[37]:
Console
Model size: 7B parameters
------------------------------
FP16 Memory: 14.00 GB
INT4 Memory: 3.67 GB (approximate)

This dramatic reduction allows a 7B model to fit comfortably within the 6GB or 8GB VRAM limits of many consumer GPUs, making powerful LLMs accessible on commodity hardware. What was previously possible only on expensive data center GPUs becomes achievable on hardware that many of you already own.

Limitations and Practical Considerations

INT4 quantization enables running large language models on consumer hardware, but it comes with important trade-offs that you must understand.

The most significant limitation is accuracy degradation on complex tasks. While INT4 models perform well on conversational AI and text generation, they often struggle with tasks requiring precise reasoning. Mathematical problem solving, multi-step logical inference, and tasks requiring exact factual recall show measurable degradation. For applications where accuracy is critical, INT8 or even FP16 may be necessary despite the higher memory cost.

Another practical concern is the computational overhead during inference. Although INT4 weights consume less memory, the dequantization step (converting INT4 back to FP16 for matrix multiplication) adds latency. Modern GPUs lack native INT4 compute support, so each forward pass must dequantize on the fly. This overhead is typically small relative to the overall forward pass, but it's not zero. Libraries like bitsandbytes and upcoming chapters on GPTQ and AWQ implement various optimizations to minimize this overhead.

Calibration requirements also vary between quantization methods. The simple symmetric quantization we implemented here doesn't require calibration data, but more sophisticated methods like GPTQ (covered in the next chapter) use calibration data to find optimal quantization parameters. The quality of calibration data affects the final model quality.

Finally, INT4 quantization is primarily a memory optimization, not a speed optimization. While reducing memory enables running larger models or larger batch sizes, the actual compute doesn't speed up proportionally because dequantization is required. For pure throughput optimization, other techniques like speculative decoding (covered later in this part) may be more effective.

Summary

INT4 quantization pushes the boundaries of model compression, reducing memory requirements to roughly 4.5 bits per weight (including scale factors). The key insights from this chapter are:

  • 16 levels are not enough for naive uniform quantization. The limited representational capacity requires sophisticated techniques to maintain model quality.

  • Group-wise quantization is essential for INT4 success. By using separate scale factors for groups of 32-128 weights, outliers affect only their local group rather than the entire tensor.

  • NF4 format outperforms uniform INT4 for normally distributed weights by placing quantization levels to match the weight distribution. This format, introduced with QLoRA, has become the de facto standard for 4-bit inference.

  • Double quantization reduces the memory overhead from scale factors by quantizing the scales themselves, getting closer to true 4 bits per weight.

  • Model size and task complexity determine quantization tolerance. Larger models (30B+) typically maintain quality at INT4, while smaller models may need INT8 or mixed precision strategies.

The next chapter covers GPTQ, a calibration-based quantization method that uses second-order information to find optimal weight quantization, often achieving better accuracy than the simpler methods discussed here.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about INT4 quantization techniques.

Loading component...

Reference

BIBTEXAcademic
@misc{int4quantizationgroupwisemethodsnf4formatforllms, author = {Michael Brenndoerfer}, title = {INT4 Quantization: Group-wise Methods & NF4 Format for LLMs}, year = {2026}, url = {https://mbrenndoerfer.com/writing/int4-quantization-group-wise-nf4-format-llms}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). INT4 Quantization: Group-wise Methods & NF4 Format for LLMs. Retrieved from https://mbrenndoerfer.com/writing/int4-quantization-group-wise-nf4-format-llms
MLAAcademic
Michael Brenndoerfer. "INT4 Quantization: Group-wise Methods & NF4 Format for LLMs." 2026. Web. today. <https://mbrenndoerfer.com/writing/int4-quantization-group-wise-nf4-format-llms>.
CHICAGOAcademic
Michael Brenndoerfer. "INT4 Quantization: Group-wise Methods & NF4 Format for LLMs." Accessed today. https://mbrenndoerfer.com/writing/int4-quantization-group-wise-nf4-format-llms.
HARVARDAcademic
Michael Brenndoerfer (2026) 'INT4 Quantization: Group-wise Methods & NF4 Format for LLMs'. Available at: https://mbrenndoerfer.com/writing/int4-quantization-group-wise-nf4-format-llms (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). INT4 Quantization: Group-wise Methods & NF4 Format for LLMs. https://mbrenndoerfer.com/writing/int4-quantization-group-wise-nf4-format-llms