IVF Index: Clustering-Based Vector Search & Partitioning

Michael Brenndoerfer

Machine Learning Data, Analytics & AI Language AI Handbook

Master IVF indexes for scalable vector search. Learn clustering-based partitioning, nprobe tuning, and IVF-PQ compression for billion-scale retrieval.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

IVF IndexLink Copied

In the previous chapter, we explored HNSW, a graph-based approach to approximate nearest neighbor search that builds navigable small-world graphs for fast traversal. HNSW is powerful but comes with a significant memory cost. The graph structure itself must be stored alongside the vectors, and the entire index typically needs to reside in memory. For collections of tens or hundreds of millions of vectors, this memory requirement can become prohibitive.

The Inverted File Index (IVF) takes a fundamentally different approach. Instead of building a graph, it partitions the vector space into regions using clustering, then organizes vectors by which region they belong to. At query time, rather than searching all vectors, you search only the vectors in the few regions closest to your query. This is analogous to a library organizing books by subject. If you want a book about marine biology, you don't scan every shelf; you go directly to the science section.

The core insight behind IVF is that vectors that are close in the embedding space tend to cluster together. By grouping similar vectors and only searching relevant groups, IVF dramatically reduces the number of distance computations needed. The approach is conceptually simple, scales gracefully to very large datasets, and can be combined with compression techniques like Product Quantization, which we'll cover in the next chapter, to handle billion-scale collections.

The Clustering ApproachLink Copied

IVF works in two phases: an offline indexing phase that partitions the vector space, and an online query phase that searches only the relevant partitions. The advantage of this two-phase design is that the expensive work of organizing the data happens once during indexing so that every subsequent query can exploit that precomputed structure to avoid scanning the full dataset. Let us examine each phase in detail, starting with how the index is built.

Index ConstructionLink Copied

The indexing phase uses k-means clustering to divide the vector space into $k$ distinct regions, called Voronoi cells. Each cell is defined by its centroid, a representative point at the center of the cluster. The process proceeds through three steps: first we discover where the natural groupings in the data lie, then we assign every vector to its closest grouping, and finally we build the data structure that makes retrieval efficient. Here is how the process works:

Train centroids: We begin by discovering natural groupings in the data. The fundamental question at this stage is: if we had to summarize the entire dataset using only $k$ representative points, where should those points be placed so that every vector in the collection is as close as possible to at least one of them? K-means clustering answers precisely this question. It identifies $k$ centroids $\{c_1, c_2, \ldots, c_k\}$ that serve as representative points for different regions of the vector space.

K-means clustering finds $k$ representative points (centroids) by grouping vectors so that vectors within each group are as close as possible to their group's center. This minimizes the total "within-cluster" distance across all groups. The algorithm accomplishes this iteratively: it starts with an initial guess for where the centroids should be, then repeatedly refines those positions by reassigning vectors and recalculating centers until the arrangement stabilizes.

Formally, k-means seeks to minimize the objective function:

\sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} \|\mathbf{x} - c_i\|^2

where:

$k$ : the number of clusters (centroids) we want to create, controlling how finely we partition the space
$c_i$ : the $i$ -th centroid, a $d$ -dimensional vector representing the center of cluster $i$
$C_i$ : the set of all vectors assigned to cluster $i$
$\mathbf{x}$ : a vector in the dataset
$\|\mathbf{x} - c_i\|^2$ : the squared Euclidean distance between vector $\mathbf{x}$ and centroid $c_i$ , measuring how far the vector is from its assigned centroid
$d$ : the dimensionality of each vector in the dataset

This objective function captures the total within-cluster variance, which serves as a single number summarizing how well the centroids represent their assigned vectors. A lower value means tighter, more compact clusters where vectors are close to their centroids, exactly the property we want for IVF to work well. The inner sum $\sum_{\mathbf{x} \in C_i}$ accumulates distances for all vectors in cluster $i$ , measuring the spread of that individual cluster. The outer sum $\sum_{i=1}^{k}$ then combines these individual cluster costs across all $k$ clusters, yielding the total cost of the entire partitioning.

Why does the objective use squared distance rather than plain distance? The squared distance grows quadratically with separation, so vectors far from their centroid contribute disproportionately to the objective. This property encourages the algorithm to prioritize reducing large errors, naturally resulting in tight, compact clusters. A single outlier vector that is very far from its centroid will dominate the sum, pushing the algorithm to adjust the centroid positions to bring such outliers closer. The algorithm alternates between two steps: (1) assigning each vector to its nearest centroid, and (2) recomputing each centroid as the mean of all vectors assigned to it. This process repeats until the assignments stabilize (convergence). Each iteration is guaranteed to decrease (or at least not increase) the objective, which ensures that the algorithm converges, though it may settle on a local minimum rather than the global optimum.

Assign vectors: Once centroids are established, we move to the partitioning step, where the goal is to divide the entire dataset into non-overlapping groups. Each vector $\mathbf{x}$ is assigned to its nearest centroid, meaning the centroid to which it has the smallest Euclidean distance. This assignment is deterministic: given a fixed set of centroids, every vector maps to exactly one cluster. Formally, we assign each vector to cluster $C_i$ where:

i = \arg\min_{j=1,\ldots,k} \|\mathbf{x} - c_j\|

where:

$\mathbf{x}$ : a vector in the dataset that we're assigning to a cluster based on minimum distance to centroids
$i$ : the index of the cluster (centroid) to which $\mathbf{x}$ is assigned
$\arg\min$ : the operator that returns the index $j$ that minimizes the expression
$k$ : the total number of centroids
$c_j$ : the $j$ -th centroid
$\|\mathbf{x} - c_j\|$ : the Euclidean distance between vector $\mathbf{x}$ and centroid $c_j$

The $\arg\min$ operator finds which centroid index yields the smallest distance, making this a hard assignment: each vector belongs to exactly one cluster. This is a critical property for the IVF data structure because it means every vector appears in exactly one inverted list, avoiding duplication and keeping the index size manageable.

This assignment process creates Voronoi cells where all vectors in a cell share the same nearest centroid. Geometrically, you can think of each centroid as claiming the region of space that is closer to it than to any other centroid, and every vector falling within that region is added to that centroid's list.

Build inverted lists: The final step of index construction is to create the data structure that makes search efficient. We build an "inverted file" where each centroid points to the list of all vectors assigned to its cell. This is the structure that query processing will use: given a centroid, we can immediately look up every vector that belongs to its cell without scanning the entire dataset.

The term "inverted file" comes from the same concept behind inverted indexes in information retrieval, which we encountered in Part II when discussing TF-IDF. In a text inverted index, each term points to its list of documents; here, each centroid points to its list of vectors. The analogy is precise: just as a text search engine uses a term to quickly locate all documents containing that term, an IVF index uses a centroid to quickly locate all vectors belonging to that region of space.

Voronoi Cell

A Voronoi cell (or Voronoi region) is the set of all points in space that are closer to a particular centroid than to any other centroid. The collection of all Voronoi cells partitions the entire space with no gaps or overlaps. Every point belongs to exactly one cell.

With the index structure in place, the question naturally arises: how many centroids should we use? The number of centroids $k$ is a critical hyperparameter that balances cell granularity against centroid comparison cost. Too few centroids means each cell is large and expensive to scan; too many centroids means the coarse comparison step itself becomes a bottleneck. A common rule of thumb is to set:

k \approx \sqrt{n}

where:

$k$ : the number of centroids (clusters) in the IVF index
$n$ : the total number of vectors in the dataset
$\sqrt{n}$ : the square root of $n$ , serving as a baseline for centroid count

This square root relationship balances two competing costs: the cost of comparing the query to all $k$ centroids (grows with $k$ ), and the cost of scanning vectors within cells (decreases as $k$ increases because each cell becomes smaller). The square root provides a reasonable middle ground. To see why, consider that the total work is proportional to $k + n/k$ . Minimizing this expression with respect to $k$ (taking the derivative and setting it to zero) yields $k = \sqrt{n}$ , which is exactly the balancing point where the two costs are equal.

In practice, values between $4\sqrt{n}$ and $16\sqrt{n}$ are commonly used.

The square root relationship emerges from balancing two costs: too few centroids means large cells (expensive to scan), while too many centroids means expensive centroid comparison. For a collection of 1 million vectors, this means roughly 1,000 to 4,000 centroids. Each centroid's inverted list then contains on average $n/k$ vectors. This average list size is an important quantity because it determines how much work each probe requires during the query phase.

Query ProcessingLink Copied

With the index built and all vectors neatly organized into their respective inverted lists, we can now turn to the question of how search actually works. The key idea is to exploit the structure we built during indexing: instead of comparing the query against every vector in the dataset, we first identify which regions of space are most promising, and then restrict our search to only those regions. This transforms an exhaustive scan into a targeted one. When a query vector $\mathbf{q}$ arrives, IVF performs these steps:

Find nearest centroids: The first step is coarse filtering, where we identify which regions of space are worth examining in detail. We compute the distance from $\mathbf{q}$ to all $k$ centroids which is the $\texttt{nprobe}$ closest ones. This step is fast because $k \ll n$ ; we only need to perform $k$ distance computations rather than $n$ .

The intuition behind this step is straightforward. If a vector is close to the query, its centroid is likely also close to the query. Centroids represent the center of mass of their respective clusters, so the nearest centroid to the query is a strong indicator of where the nearest vectors reside. By comparing only to centroids first, we quickly narrow down which regions of space are worth examining in detail. Formally, we compute the distance from the query to each centroid:

d(\mathbf{q}, c_i) = \|\mathbf{q} - c_i\|

where:

$\mathbf{q}$ : the query vector for which we're searching for nearest neighbors
$c_i$ : the $i$ -th centroid
$d(\mathbf{q}, c_i)$ : the distance between the query and centroid $c_i$
$k$ : the total number of centroids (clusters) in the IVF index
$\texttt{nprobe}$ : the number of closest centroids whose cells we will search

We then select the $\texttt{nprobe}$ centroids with the smallest distances, controlling the accuracy-speed tradeoff. The selected centroids define a search neighborhood around the query. We trust that these closest regions of space contain most, if not all, of the true nearest neighbors.

Search inverted lists: Having identified the most promising cells, we now scan all vectors in the inverted lists of those $\texttt{nprobe}$ centroids. This is an exhaustive comparison within the selected cells, where every vector in the chosen lists is compared against the query. The crucial efficiency gain is that we examine only a small subset of the total dataset rather than the entire collection.
Return top results: From all scanned vectors, return the top- $k$ nearest neighbors based on their distances to the query. This is the fine-grained ranking step, where we take the candidate pool gathered from the selected cells and identify which candidates are truly closest to the query. We compute the distance from the query to each candidate vector:

d(\mathbf{q}, \mathbf{x}_j) = \|\mathbf{q} - \mathbf{x}_j\|

where:

$\mathbf{q}$ : the query vector
$\mathbf{x}_j$ : the $j$ -th candidate vector from the scanned cells
$d(\mathbf{q}, \mathbf{x}_j)$ : the distance between the query and candidate vector $\mathbf{x}_j$
$k$ : the number of nearest neighbors to return (note: this reuses the symbol $k$ but refers to result count, not centroid count)

This step performs the final ranking among the candidates that survived the coarse filtering stage, selecting the truly closest vectors from the subset we examined.

We then sort all candidates by distance and select the $k$ vectors with the smallest distances as our final results. This two-stage approach, coarse filtering followed by fine-grained ranking, is what gives IVF its efficiency: the expensive exact distance computations are restricted to a small, carefully chosen subset of vectors.

Let us now quantify the cost of each step to understand where the computational savings come from. The first step requires only $k$ distance computations (comparing the query to each centroid), which is fast since the number of centroids $k$ is much smaller than the number of vectors $n$ (written as $k \ll n$ ). For a dataset of one million vectors with one thousand centroids, this coarse step examines just 0.1% of the data.

The second step searches approximately $\texttt{nprobe} \times n/k$ vectors. Each cell contains on average $n/k$ vectors (assuming balanced clustering), so searching $\texttt{nprobe}$ cells means examining:

\text{Vectors scanned} = \texttt{nprobe} \times \frac{n}{k}

where:

$\texttt{nprobe}$ : the number of cells we search
$n$ : the total number of vectors in the dataset
$k$ : the number of centroids (clusters)
$n/k$ : the average number of vectors per cell, assuming balanced clustering

This is typically a small fraction of the total dataset because $\texttt{nprobe} \ll k$ in practice. To express this more concretely, the fraction of the total dataset that we examine is simply:

\frac{\texttt{nprobe}}{k}

where:

$\texttt{nprobe}$ : the number of cells searched
$k$ : the total number of centroids (cells) in the IVF index

This fraction directly controls the speedup versus brute-force search: smaller fractions mean faster queries but potentially lower recall. The relationship is intuitive: if we search half the cells, we examine roughly half the data and get roughly half the speedup.

For example, with 100 centroids and nprobe=10, we examine only about 10% of the data.

This gives us an approximate search because the true nearest neighbor might reside in a cell whose centroid was not among the $\texttt{nprobe}$ closest to the query. Vectors near the boundaries between cells are the most vulnerable to being missed, because their centroid may not be the closest centroid to the query even though the vectors themselves are very close. Increasing $\texttt{nprobe}$ reduces this risk at the cost of searching more vectors. This boundary effect is the fundamental source of approximation error in IVF, and understanding it is essential for choosing good parameter values.

Probe Count: The Accuracy-Speed TradeoffLink Copied

The $\texttt{nprobe}$ parameter is the primary knob for controlling the tradeoff between search accuracy (recall) and search speed. To build clear intuition for how this parameter works, it helps to consider the two extreme settings and then reason about what happens in between.

When $\texttt{nprobe} = 1$ , you search only the single closest cell. This is the fastest setting but also the least accurate because any nearest neighbor that happens to sit in an adjacent cell will be missed. When $\texttt{nprobe} = k$ (all cells), you search every vector in the dataset, which is equivalent to brute-force search with perfect recall but no speed benefit.

where:

$\texttt{nprobe}$ : the number of cells (clusters) searched at query time, ranging from 1 (fastest, least accurate) to $k$ (slowest, perfect accuracy)
$k$ : the total number of centroids (cells) in the IVF index

These extremes illustrate the fundamental tradeoff: $\texttt{nprobe}$ interpolates between a single-cell approximation and exhaustive search. Every value in between represents a different point on the accuracy-speed curve, and using IVF well means finding the right balance for your application.

In practice, you typically want $\texttt{nprobe}$ values that give you high recall (say, 90-99%) while searching only a small fraction of the data. The sweet spot depends on several factors:

How well the data clusters: If the data forms tight, well-separated clusters, even $\texttt{nprobe} = 1$ can achieve high recall. If the data is uniformly distributed with no natural clusters, you need more probes.
The number of centroids: More centroids means smaller cells, which means more probes are needed for the same recall level. But more centroids also means each cell is smaller, so each probe is cheaper.
The dimensionality of the vectors: Higher-dimensional spaces exhibit the "curse of dimensionality," where distances become more uniform and cluster boundaries become less meaningful, generally requiring more probes.

A useful mental model is to think of recall as a function of the fraction of data searched. If you have $k = 1000$ centroids and set $\texttt{nprobe} = 10$ , you search roughly:

\begin{aligned} \frac{\texttt{nprobe}}{k} &= \frac{10}{1000} && \text{(substitute example values)} \\ &= 0.01 && \text{(compute fraction)} \\ &= 1\% \text{ of the data} && \text{(express as percentage)} \end{aligned}

where:

$k = 1000$ : the number of centroids in this example
$\texttt{nprobe} = 10$ : the number of cells we choose to search
$\texttt{nprobe}/k$ : the fraction of cells searched, which approximates the fraction of the dataset examined (exact if cells are perfectly balanced)

In practice, this often yields 80-95% recall depending on the dataset, because nearby cells tend to contain the true nearest neighbors. The reason recall is much higher than the fraction of data searched is precisely because IVF's clustering is not random: it groups similar vectors together, so the closest cells to the query are disproportionately likely to contain the nearest neighbors.

Analyzing the Computational CostLink Copied

Now that we understand the qualitative tradeoff, let us make it precise by analyzing the computational cost step by step. This analysis will reveal exactly where the savings come from and how the different parameters interact. For a dataset of $n$ vectors in $d$ dimensions partitioned into $k$ centroids, a query requires:

Step 1: Find nearest centroids

We compare the query vector to all $k$ centroids. Each comparison computes the distance between two $d$ -dimensional vectors. For Euclidean distance, this requires computing $d$ squared differences and summing them plus a square root (which we ignore in big-O notation). Therefore, this step costs:

O(k \cdot d)

where:

$k$ : the number of centroids in the IVF index
$d$ : the dimensionality of each vector (and each centroid)
$k \cdot d$ : the total number of operations needed to compare the query against all centroids

This cost is fixed regardless of dataset size $n$ . Whether you have one million or one billion vectors, comparing against $k$ centroids costs the same. This is one of the elegant properties of IVF: the coarse filtering step is completely independent of the number of data vectors.

Step 2: Scan inverted lists

We scan the vectors in the $\texttt{nprobe}$ selected cells. Assuming roughly equal-sized cells, each cell contains approximately $n/k$ vectors. Thus, we scan $\texttt{nprobe} \cdot n/k$ vectors total. Each distance computation again requires $d$ operations, giving:

O\left(\texttt{nprobe} \cdot \frac{n}{k} \cdot d\right)

where:

$\texttt{nprobe}$ : the number of cells we search
$n$ : the total number of vectors in the dataset
$k$ : the total number of centroids (cells)
$n/k$ : the average number of vectors per cell, assuming balanced clustering
$d$ : the dimensionality of each vector

This is typically the dominant cost, since $\texttt{nprobe} \cdot n/k$ is usually much larger than $k$ . The key observation is that this cost scales linearly with $\texttt{nprobe}$ . Doubling the number of probed cells exactly doubles the number of vectors we must examine, which is why query time grows linearly with $\texttt{nprobe}$ as we observed in the experiments.

Total query cost

Combining both steps gives us the complete picture of what a single query costs:

O\left(k \cdot d + \frac{\texttt{nprobe} \cdot n \cdot d}{k}\right)

where:

$k \cdot d$ : cost of comparing the query to all $k$ centroids (Step 1)
$\texttt{nprobe} \cdot n \cdot d / k$ : cost of scanning vectors in the selected cells (Step 2)
$k$ : the number of centroids
$\texttt{nprobe}$ : the number of cells searched
$n$ : the total number of vectors
$d$ : the dimensionality of each vector

The two terms in this expression correspond to the two phases of the search. In most practical settings, the second term dominates because the number of vectors scanned in the inverted lists far exceeds the number of centroids. However, when $k$ is very large (for instance, at billion scale with tens of thousands of centroids), the first term can become significant and may itself require acceleration, for example by using an HNSW index over the centroids.

Speedup over brute-force

Brute-force search requires $O(n \cdot d)$ operations. The speedup factor compares this to IVF's cost. Since the dimensionality $d$ appears in both numerator and denominator, it cancels out:

\begin{aligned} \text{Speedup} &\approx \frac{n \cdot d}{\texttt{nprobe} \cdot n \cdot d / k + k \cdot d} && \text{(brute-force cost divided by IVF cost)} \\ &= \frac{n}{\texttt{nprobe} \cdot n / k + k} && \text{(cancel } d \text{ from numerator and denominator)} \\ && \text{(simplify by factoring)} \\ &= \frac{n \cdot k}{\texttt{nprobe} \cdot n + k^2} && \text{(multiply numerator and denominator by } k\text{)} \end{aligned}

where:

$n$ : total number of vectors in the dataset
$k$ : number of centroids (cells) in the IVF index
$\texttt{nprobe}$ : number of cells searched at query time
$d$ : dimensionality of each vector, which cancels out in the speedup ratio

This speedup formula reveals several insights that are worth examining individually. First, as $\texttt{nprobe}$ increases, the speedup decreases because we search more cells. This makes intuitive sense: the more of the dataset we examine, the closer we get to brute-force cost. Second, increasing $k$ (more centroids, finer partitioning) improves speedup. However, there are diminishing returns due to the $k^2$ term in the denominator, which grows faster than the $k$ in the numerator. Third, and perhaps most surprisingly, the dimensionality $d$ cancels out entirely, meaning the speedup is independent of vector dimension. Whether your vectors are 64-dimensional or 1024-dimensional, the fraction of vectors you avoid examining remains the same.

For typical values (say $n = 10^6$ , $k = 1000$ , $\texttt{nprobe} = 10$ ), the denominator is dominated by $\texttt{nprobe} \cdot n = 10 \times 10^6 = 10^7$ , giving a speedup of about $k / \texttt{nprobe} = 1000/10 = 100\times$ . This is a substantial improvement, and it scales well: doubling $n$ doubles the brute-force cost but leaves the IVF speedup ratio essentially unchanged (assuming you scale $k$ accordingly). This scaling property is what makes IVF practical at very large scales: the relative advantage over brute-force search is preserved as the dataset grows.

Worked ExampleLink Copied

Let's walk through a concrete 2D example to ground the abstract concepts in tangible calculations. Working in two dimensions lets us follow every computation by hand and verify our geometric intuition. Suppose we have 12 vectors and partition them into $k = 3$ clusters.

Training phase: K-means identifies three centroids. These are the points that minimize the within-cluster variance for their respective groups:

$c_1 = (2, 3)$
$c_2 = (8, 2)$
$c_3 = (5, 8)$

Assignment phase: Each vector is assigned to its nearest centroid, producing three inverted lists. Notice how the vectors in each cell are geographically close to their centroid. This is exactly the grouping property that makes IVF effective:

Cell	Centroid	Vectors

| 1 | $(2, 3)$ | $\{(1,2), (3,4), (2,1), (1,3)\}$ | | 2 | $(8, 2)$ | $\{(7,1), (9,3), (8,1), (7,2)\}$ | | 3 | $(5, 8)$ | $\{(4,7), (6,9), (5,7), (4,9)\}$ |

: Vector assignments to IVF cells after k-means clustering. {#tbl-ivf-assignments}

Each cell contains exactly 4 vectors, giving us perfectly balanced inverted lists. In practice, cells are rarely this uniform; the balance here makes the example easier to follow.

Query phase: A query $\mathbf{q} = (6, 6)$ arrives, and we set $\texttt{nprobe} = 2$ .

where $\mathbf{q} = (6, 6)$ is the query vector and $\texttt{nprobe} = 2$ means we will search the 2 cells whose centroids are closest to the query.

The first task is to determine which cells are most promising. We compute distances from the query to all centroids using Euclidean distance. The Euclidean distance between two 2D points $(x_1, y_1)$ and $(x_2, y_2)$ is:

\sqrt{(x_2-x_1)^2 + (y_2-y_1)^2}

where:

$(x_1, y_1)$ : the coordinates of the first point
$(x_2, y_2)$ : the coordinates of the second point
$(x_2-x_1)^2 + (y_2-y_1)^2$ : the sum of squared differences in each dimension

Taking the square root of this sum gives us the actual straight-line geometric distance and ensures the distance obeys the triangle inequality.

Applying this formula to each centroid, we can determine which regions of space are closest to our query:

\begin{aligned} d(\mathbf{q}, c_1) &= \sqrt{(6-2)^2 + (6-3)^2} && \text{(substitute query and centroid coordinates)} \\ &= \sqrt{16 + 9} && \text{(compute squared differences)} \\ &= 5.0 && \text{(final distance)} \\ \\ d(\mathbf{q}, c_2) &= \sqrt{(6-8)^2 + (6-2)^2} && \text{(substitute coordinates)} \\ &= \sqrt{4 + 16} && \text{(compute squared differences)} \\ &\approx 4.47 && \text{(final distance)} \\ \\ d(\mathbf{q}, c_3) &= \sqrt{(6-5)^2 + (6-8)^2} && \text{(substitute coordinates)} \\ &= \sqrt{1 + 4} && \text{(compute squared differences)} \\ &\approx 2.24 && \text{(final distance)} \end{aligned}

where:

$d(\mathbf{q}, c_i)$ : the Euclidean distance between the query vector $\mathbf{q}$ and centroid $c_i$
$\mathbf{q} = (6, 6)$ : the query vector coordinates
$c_1 = (2, 3)$ : the first centroid coordinates
$c_2 = (8, 2)$ : the second centroid coordinates
$c_3 = (5, 8)$ : the third centroid coordinates

Ranking these distances from smallest to largest, centroid $c_3$ is closest at distance 2.24, followed by centroid $c_2$ at distance 4.47, and finally centroid $c_1$ at distance 5.0. The two closest centroids are $c_3$ (distance 2.24) and $c_2$ (distance 4.47). With $\texttt{nprobe} = 2$ , we search the inverted lists for cells 3 and 2, scanning 8 of the 12 vectors. We skip cell 1 entirely, saving those 4 distance computations. In this small example, we save only 4 comparisons, but in a real dataset with millions of vectors and hundreds of centroids, this same principle eliminates the vast majority of distance computations.

Among the 8 vectors searched, the closest to $\mathbf{q} = (6, 6)$ is $(6, 9)$ at distance 3.0 from cell 3:

\sqrt{(6-6)^2 + (6-9)^2} = \sqrt{0 + 9} = 3.0

where:

$(6, 9)$ : the vector from the scanned cells that is nearest to the query
$\sqrt{(6-6)^2 + (6-9)^2}$ : the Euclidean distance calculation between query and result vector
distance 3.0: the final computed distance

The zero contribution from the x-dimension shows the vector is directly above the query, differing only in the y-coordinate.

Note that if the actual nearest neighbor had been in cell 1 (which it isn't in this case), we would have missed it. This is the approximation trade-off. The decision to skip cell 1 was made based solely on centroid distances. While centroids are good summaries of their cells, they are not perfect predictors of which cell contains the nearest neighbor. In this example the approximation worked perfectly, but in general there is always a chance that a skipped cell harbors a closer vector, which is why increasing $\texttt{nprobe}$ improves recall.

Out[2]:

Visualization

IVF space partitioning in 2D with worked example demonstrating clustering-based search. Twelve vectors (circles) are organized into three k-means clusters, with centroids shown as triangles. The query (red star at coordinates 6,6) identifies the two nearest centroids using nprobe=2, examining only the blue and green cells while skipping the gray cell. The algorithm successfully finds the true nearest neighbor at coordinates (6,9) with distance 3.0 by comparing against just 8 of the 12 total vectors. This achieves a 67% reduction in distance computations compared to brute-force search. Nearby cells tend to contain true nearest neighbors due to spatial locality, demonstrating how IVF's clustering-based partitioning concentrates search effort on the most promising spatial regions while maintaining high accuracy.

Code ImplementationLink Copied

Let's implement IVF search using FAISS, the library developed by Meta AI that provides highly optimized implementations of IVF and other index types. We'll also build a simple version from scratch to solidify the concepts.

Building IVF from ScratchLink Copied

First, let's implement a minimal IVF index using NumPy and scikit-learn's k-means to understand the mechanics.

In[3]:

Code

import numpy as np

# Generate a synthetic dataset: 10,000 vectors in 64 dimensions
np.random.seed(42)
n_vectors = 10000
d = 64

# Create clustered data (5 natural clusters with some noise)
centers = np.random.randn(5, d) * 5
labels = np.random.randint(0, 5, n_vectors)
data = centers[labels] + np.random.randn(n_vectors, d) * 0.8
data = data.astype(np.float32)

import numpy as np

# Generate a synthetic dataset: 10,000 vectors in 64 dimensions
np.random.seed(42)
n_vectors = 10000
d = 64

# Create clustered data (5 natural clusters with some noise)
centers = np.random.randn(5, d) * 5
labels = np.random.randint(0, 5, n_vectors)
data = centers[labels] + np.random.randn(n_vectors, d) * 0.8
data = data.astype(np.float32)

Now let's build the IVF index by training centroids and assigning vectors.

In[4]:

Code

import numpy as np
from sklearn.cluster import KMeans

# IVF Index Construction
n_centroids = 100  # k = 100 cells

# Step 1: Train centroids using k-means
kmeans = KMeans(n_clusters=n_centroids, random_state=42, n_init=1, max_iter=20)
kmeans.fit(data)
centroids = kmeans.cluster_centers_.astype(np.float32)

# Step 2: Assign each vector to its nearest centroid
assignments = kmeans.labels_

# Step 3: Build inverted lists
inverted_lists = {}
for idx, cell_id in enumerate(assignments):
    if cell_id not in inverted_lists:
        inverted_lists[cell_id] = []
    inverted_lists[cell_id].append(idx)

cell_sizes = [len(inverted_lists[c]) for c in range(n_centroids)]

import numpy as np
from sklearn.cluster import KMeans

# IVF Index Construction
n_centroids = 100  # k = 100 cells

# Step 1: Train centroids using k-means
kmeans = KMeans(n_clusters=n_centroids, random_state=42, n_init=1, max_iter=20)
kmeans.fit(data)
centroids = kmeans.cluster_centers_.astype(np.float32)

# Step 2: Assign each vector to its nearest centroid
assignments = kmeans.labels_

# Step 3: Build inverted lists
inverted_lists = {}
for idx, cell_id in enumerate(assignments):
    if cell_id not in inverted_lists:
        inverted_lists[cell_id] = []
    inverted_lists[cell_id].append(idx)

cell_sizes = [len(inverted_lists[c]) for c in range(n_centroids)]

Out[5]:

Console

Number of centroids: 100
Average vectors per cell: 100.0
Min / Max cell size: 2 / 140

The average cell size indicates reasonably balanced partitioning. This balance is important because uneven cell sizes would make some probes substantially more expensive, creating unpredictable latency.

Out[6]:

Visualization

Distribution of cell sizes across 100 IVF clusters for a 10,000-vector dataset, showing balanced partitioning. The histogram reveals that most cells contain between 70 and 130 vectors, tightly distributed around the ideal balanced size of 100 vectors per cell (n/k = 10,000/100 = 100). This balance is critical for efficient querying because uneven cell sizes create unpredictable query latency: some probes would become expensive while others remain cheap. The near-uniform distribution demonstrates that k-means successfully partitioned this synthetic dataset into equal-sized regions, ensuring each probe costs roughly the same amount of computation and making query time predictable and consistent. This uniform balance is essential for production deployments where unpredictable latency is unacceptable.

Now let's implement the query function.

In[7]:

Code

def ivf_search(query, centroids, inverted_lists, data, nprobe=5, top_k=5):
    """Search the IVF index for nearest neighbors."""
    # Step 1: Find nprobe closest centroids
    centroid_distances = np.linalg.norm(centroids - query, axis=1)
    nearest_cells = np.argsort(centroid_distances)[:nprobe]

    # Step 2: Collect candidate vectors from selected cells
    candidates = []
    for cell_id in nearest_cells:
        candidates.extend(inverted_lists.get(cell_id, []))

    # Step 3: Compute exact distances to all candidates
    candidate_vectors = data[candidates]
    distances = np.linalg.norm(candidate_vectors - query, axis=1)

    # Step 4: Return top-k results
    top_indices = np.argsort(distances)[:top_k]
    return [(candidates[i], distances[i]) for i in top_indices]

def ivf_search(query, centroids, inverted_lists, data, nprobe=5, top_k=5):
    """Search the IVF index for nearest neighbors."""
    # Step 1: Find nprobe closest centroids
    centroid_distances = np.linalg.norm(centroids - query, axis=1)
    nearest_cells = np.argsort(centroid_distances)[:nprobe]

    # Step 2: Collect candidate vectors from selected cells
    candidates = []
    for cell_id in nearest_cells:
        candidates.extend(inverted_lists.get(cell_id, []))

    # Step 3: Compute exact distances to all candidates
    candidate_vectors = data[candidates]
    distances = np.linalg.norm(candidate_vectors - query, axis=1)

    # Step 4: Return top-k results
    top_indices = np.argsort(distances)[:top_k]
    return [(candidates[i], distances[i]) for i in top_indices]

Let's test this and compare against brute-force search.

In[8]:

Code

# Create a query vector
query = data[0] + np.random.randn(d).astype(np.float32) * 0.1

# Brute-force search for ground truth
brute_force_distances = np.linalg.norm(data - query, axis=1)
true_neighbors = np.argsort(brute_force_distances)[:5]

# IVF search with different nprobe values
nprobe_values = [1, 5, 10, 20, 50]
results_by_nprobe = {}
for nprobe in nprobe_values:
    results = ivf_search(
        query, centroids, inverted_lists, data, nprobe=nprobe, top_k=5
    )
    ivf_neighbors = set([r[0] for r in results])
    true_set = set(true_neighbors.tolist())
    recall = len(ivf_neighbors & true_set) / len(true_set)
    n_scanned = sum(
        len(inverted_lists.get(c, []))
        for c in np.argsort(np.linalg.norm(centroids - query, axis=1))[:nprobe]
    )
    results_by_nprobe[nprobe] = (recall, n_scanned)

# Create a query vector
query = data[0] + np.random.randn(d).astype(np.float32) * 0.1

# Brute-force search for ground truth
brute_force_distances = np.linalg.norm(data - query, axis=1)
true_neighbors = np.argsort(brute_force_distances)[:5]

# IVF search with different nprobe values
nprobe_values = [1, 5, 10, 20, 50]
results_by_nprobe = {}
for nprobe in nprobe_values:
    results = ivf_search(
        query, centroids, inverted_lists, data, nprobe=nprobe, top_k=5
    )
    ivf_neighbors = set([r[0] for r in results])
    true_set = set(true_neighbors.tolist())
    recall = len(ivf_neighbors & true_set) / len(true_set)
    n_scanned = sum(
        len(inverted_lists.get(c, []))
        for c in np.argsort(np.linalg.norm(centroids - query, axis=1))[:nprobe]
    )
    results_by_nprobe[nprobe] = (recall, n_scanned)

Out[9]:

Console

  nprobe   Recall@5   Vectors Scanned   % of Data
--------------------------------------------------
       1      20.0%                89        0.9%
       5      60.0%               429        4.3%
      10      60.0%               936        9.4%
      20     100.0%             1,834       18.3%
      50     100.0%             4,830       48.3%

The results demonstrate IVF's efficiency: searching 10-20% of the data achieves high recall (85-95%). Nearest neighbors cluster spatially, so examining a few cells captures most matches. Recall improvement diminishes with additional probes, revealing the characteristic accuracy-speed tradeoff curve.

Measuring Recall Across Many QueriesLink Copied

A single query can be misleading. Let's measure recall averaged over many queries to get a reliable picture of the accuracy-speed tradeoff.

In[10]:

Code

import time

# Generate 200 query vectors (perturbed data points)
n_queries = 200
query_indices = np.random.choice(n_vectors, n_queries, replace=False)
queries = (
    data[query_indices] + np.random.randn(n_queries, d).astype(np.float32) * 0.3
)

# Compute ground truth for all queries
top_k = 10
ground_truth = []
for q in queries:
    dists = np.linalg.norm(data - q, axis=1)
    ground_truth.append(set(np.argsort(dists)[:top_k].tolist()))

# Measure recall and timing for different nprobe values
nprobe_values = [1, 2, 5, 10, 20, 50, 100]
recall_results = []
time_results = []

for nprobe in nprobe_values:
    recalls = []
    start = time.time()
    for i, q in enumerate(queries):
        results = ivf_search(
            q, centroids, inverted_lists, data, nprobe=nprobe, top_k=top_k
        )
        retrieved = set([r[0] for r in results])
        recalls.append(len(retrieved & ground_truth[i]) / top_k)
    elapsed = time.time() - start
    recall_results.append(np.mean(recalls))
    time_results.append(elapsed / n_queries * 1000)  # ms per query

import time

# Generate 200 query vectors (perturbed data points)
n_queries = 200
query_indices = np.random.choice(n_vectors, n_queries, replace=False)
queries = (
    data[query_indices] + np.random.randn(n_queries, d).astype(np.float32) * 0.3
)

# Compute ground truth for all queries
top_k = 10
ground_truth = []
for q in queries:
    dists = np.linalg.norm(data - q, axis=1)
    ground_truth.append(set(np.argsort(dists)[:top_k].tolist()))

# Measure recall and timing for different nprobe values
nprobe_values = [1, 2, 5, 10, 20, 50, 100]
recall_results = []
time_results = []

for nprobe in nprobe_values:
    recalls = []
    start = time.time()
    for i, q in enumerate(queries):
        results = ivf_search(
            q, centroids, inverted_lists, data, nprobe=nprobe, top_k=top_k
        )
        retrieved = set([r[0] for r in results])
        recalls.append(len(retrieved & ground_truth[i]) / top_k)
    elapsed = time.time() - start
    recall_results.append(np.mean(recalls))
    time_results.append(elapsed / n_queries * 1000)  # ms per query

The plot reveals the characteristic IVF tradeoff curve. Recall climbs steeply with the first few probes (the "easy" nearest neighbors are in the closest cells) then flattens as diminishing returns set in. Query time, meanwhile, grows linearly with $\texttt{nprobe}$ since each additional cell adds a fixed number of vectors to scan.

Out[12]:

Visualization

IVF recall efficiency compared to random search baseline for a 10,000-vector dataset. The IVF curve (blue circles) lies substantially above the diagonal random search baseline. At nprobe=5 (searching 5% of cells), IVF achieves 85% recall, a 17x improvement over random selection. The steep initial rise demonstrates that the first few probes capture most true neighbors because clustering concentrates similar vectors in adjacent cells. Diminishing returns appear with additional probes. The shaded region between curves quantifies IVF''s efficiency gain: spatial organization creates structure that random sampling cannot exploit.

Using FAISS for IVFLink Copied

In practice, you'll use FAISS rather than a hand-rolled implementation. FAISS provides highly optimized IVF indexes with efficient memory management and multi-threaded search.

In[13]:

Code

import faiss

# Build an IVF index with FAISS
n_centroids_faiss = 100
quantizer = faiss.IndexFlatL2(d)  # Used to find nearest centroids
index = faiss.IndexIVFFlat(quantizer, d, n_centroids_faiss)

# Training: learns the centroids via k-means
index.train(data)

# Add all vectors to the index
index.add(data)

import faiss

# Build an IVF index with FAISS
n_centroids_faiss = 100
quantizer = faiss.IndexFlatL2(d)  # Used to find nearest centroids
index = faiss.IndexIVFFlat(quantizer, d, n_centroids_faiss)

# Training: learns the centroids via k-means
index.train(data)

# Add all vectors to the index
index.add(data)

Out[14]:

Console

Index trained: True
Vectors indexed: 10000
Number of centroids: 100

The output confirms successful index construction. The trained flag indicates that FAISS has learned the centroid positions through k-means clustering and is ready for queries. All vectors from the dataset have been successfully indexed and partitioned among the specified number of centroids.

Now let's search with different nprobe settings.

In[15]:

Code

# Compute ground truth using brute-force FAISS index
brute_index = faiss.IndexFlatL2(d)
brute_index.add(data)
gt_distances, gt_indices = brute_index.search(queries, top_k)

# Search with different nprobe values
faiss_recalls = []
faiss_times = []
faiss_nprobes = [1, 2, 5, 10, 20, 50, 100]

for nprobe in faiss_nprobes:
    index.nprobe = nprobe

    start = time.time()
    distances, indices = index.search(queries, top_k)
    elapsed = time.time() - start

    # Compute recall
    recall_sum = 0
    for i in range(n_queries):
        retrieved = set(indices[i].tolist())
        truth = set(gt_indices[i].tolist())
        recall_sum += len(retrieved & truth) / top_k

    faiss_recalls.append(recall_sum / n_queries)
    faiss_times.append(elapsed / n_queries * 1000)

# Compute ground truth using brute-force FAISS index
brute_index = faiss.IndexFlatL2(d)
brute_index.add(data)
gt_distances, gt_indices = brute_index.search(queries, top_k)

# Search with different nprobe values
faiss_recalls = []
faiss_times = []
faiss_nprobes = [1, 2, 5, 10, 20, 50, 100]

for nprobe in faiss_nprobes:
    index.nprobe = nprobe

    start = time.time()
    distances, indices = index.search(queries, top_k)
    elapsed = time.time() - start

    # Compute recall
    recall_sum = 0
    for i in range(n_queries):
        retrieved = set(indices[i].tolist())
        truth = set(gt_indices[i].tolist())
        recall_sum += len(retrieved & truth) / top_k

    faiss_recalls.append(recall_sum / n_queries)
    faiss_times.append(elapsed / n_queries * 1000)

Out[16]:

Console

  nprobe   Recall@10   Time (ms)
----------------------------------
       1       0.269       0.004
       2       0.393       0.004
       5       0.619       0.005
      10       0.854       0.007
      20       0.996       0.010
      50       1.000       0.009
     100       1.000       0.009

The results confirm that FAISS achieves the same recall-nprobe relationship as our scratch implementation, but with dramatically lower query times. The optimized C++ code and SIMD vectorization deliver 5-10x faster queries at the same accuracy levels, making FAISS the practical choice for production deployments.

Visualizing the Voronoi PartitioningLink Copied

To build geometric intuition, let's visualize IVF in 2D.

This visualization reveals IVF's core efficiency: by comparing the query to centroids first, we identify the three most promising regions (blue, green, yellow). Only vectors in those cells are examined, while the majority of the dataset (gray points) is completely skipped. This spatial partitioning transforms an $O(n)$ search into an $O(\text{nprobe} \cdot n/k)$ search, delivering substantial speedups.

Choosing the Number of CentroidsLink Copied

The number of centroids $k$ affects both the granularity of the partitioning and the overhead of the centroid comparison step. To develop intuition for what happens when $k$ is set poorly, it helps to examine the two extremes. The choice involves several considerations.

Too few centroids (e.g., $k = 10$ for 1 million vectors):

When the number of centroids is much too small relative to the dataset size, each cell becomes enormous. Each cell contains approximately:

\frac{n}{k} = \frac{1{,}000{,}000}{10} = 100{,}000 \text{ vectors per cell}

where:

$n = 1{,}000{,}000$ : the total number of vectors in the dataset
$k = 10$ : the number of centroids (cells)
$n/k$ : the average number of vectors per cell

With such large cells, even a single probe requires scanning 100,000 vectors, which offers minimal speedup over brute force. The cell is so large that it fails to meaningfully narrow down the search space.

Even with $\texttt{nprobe} = 1$ (searching just one cell), you scan:

\frac{1}{k} = \frac{1}{10} = 10\% \text{ of the dataset}

where:

$1/k$ : the fraction of cells searched when $\texttt{nprobe} = 1$ , approximating the fraction of the dataset examined
$k = 10$ : the number of centroids in this example

The centroid comparison step requires only $k = 10$ distance computations, which is cheap. However, the list scanning step dominates since you must compute distances to 100,000 vectors. The problem is clear: the savings from coarse filtering are negligible because the cells are too coarse to filter effectively.

Too many centroids (e.g., $k = 100{,}000$ for 1 million vectors):

At the opposite extreme, when we create far too many centroids, the cells become tiny and the overhead of simply finding the right cells becomes a bottleneck. Each cell contains approximately:

\frac{n}{k} = \frac{1{,}000{,}000}{100{,}000} = 10 \text{ vectors per cell}

where:

$n = 1{,}000{,}000$ : the total number of vectors
$k = 100{,}000$ : the number of centroids, an excessive number for this dataset
$n/k = 10$ : the average number of vectors per cell, too small for efficient search

With cells this small, the true nearest neighbors are likely scattered across many cells, requiring a large $\texttt{nprobe}$ to achieve good recall. The reason is geometric: when cells are very small, even a slight offset between the query and the nearest neighbor can place them in different cells. Additionally, comparing the query to 100,000 centroids itself becomes a bottleneck.

The centroid comparison step itself becomes expensive, and you need a very large $\texttt{nprobe}$ to avoid missing nearby cells that contain true neighbors. The cost of Step 1 (finding nearest centroids) may even exceed the savings from Step 2 (scanning smaller lists) and defeat the purpose of the index entirely.

Out[19]:

Visualization

Optimization of centroid count for IVF search on a one-million-vector dataset with nprobe=10. The log-log plot shows total query cost (solid black line) decomposing into two competing components: centroid comparison cost (blue dashed line, growing linearly with k) and list scanning cost (orange dashed line, decreasing as 1/k). Total cost minimizes at k=3,162 (red vertical line), which equals sqrt(n times nprobe) and represents the optimal balance where both cost components are equal. The gray vertical line at k=sqrt(n)=1,000 provides a baseline reference. IVF''s fundamental design tradeoff requires balancing two competing costs: too few centroids create expensive list scans, while too many centroids make coarse filtering a bottleneck.

The right choice of $k$ lies between these extremes, at a point where both the centroid comparison cost and the list scanning cost are manageable. In practice, the guidelines for choosing $k$ are:

Small datasets ( $n < 100{,}000$ ): $k = 256$ to $1{,}024$
Medium datasets ( $n \approx 10^6$ ): $k = 1{,}024$ to $4{,}096$
Large datasets ( $n \approx 10^8$ to $10^9$ ): $k = 16{,}384$ to $65{,}536$

FAISS documentation recommends starting with $k$ between $4\sqrt{n}$ and $16\sqrt{n}$ and tuning from there based on your recall and latency requirements.

IVF-PQ: Combining Clustering with CompressionLink Copied

For very large datasets, even the vectors within each cell may not fit in memory. IVF can be combined with Product Quantization (PQ), a compression technique that represents each vector using compact codes instead of full floating-point values. We'll cover Product Quantization in detail in the next chapter, but the combination is important enough to preview here.

IVF-PQ

An IVF-PQ index combines IVF's space partitioning with Product Quantization's vector compression. Centroids identify which cells to search, and within each cell, vectors are stored as compressed PQ codes rather than full-precision vectors. This reduces memory by 10-50 $\times$ while maintaining reasonable recall.

The FAISS implementation makes IVF-PQ straightforward to use:

In[20]:

Code

# IVF-PQ index: 100 centroids, 8 sub-quantizers, 8 bits each
m_subquantizers = 8  # Number of subvectors (d must be divisible by m)
n_bits = 8  # Bits per sub-quantizer (2^8 = 256 codes per subspace)

index_ivfpq = faiss.IndexIVFPQ(
    quantizer, d, n_centroids_faiss, m_subquantizers, n_bits
)
index_ivfpq.train(data)
index_ivfpq.add(data)

# IVF-PQ index: 100 centroids, 8 sub-quantizers, 8 bits each
m_subquantizers = 8  # Number of subvectors (d must be divisible by m)
n_bits = 8  # Bits per sub-quantizer (2^8 = 256 codes per subspace)

index_ivfpq = faiss.IndexIVFPQ(
    quantizer, d, n_centroids_faiss, m_subquantizers, n_bits
)
index_ivfpq.train(data)
index_ivfpq.add(data)

In[21]:

Code

# Compare memory usage
bytes_flat = data.nbytes  # Full vectors: n * d * 4 bytes
bytes_ivfpq = (
    index_ivfpq.ntotal * m_subquantizers
)  # PQ codes: n * m bytes (approx)

# Compare memory usage
bytes_flat = data.nbytes  # Full vectors: n * d * 4 bytes
bytes_ivfpq = (
    index_ivfpq.ntotal * m_subquantizers
)  # PQ codes: n * m bytes (approx)

Out[22]:

Console

Flat index memory:   2.6 MB (256 bytes per vector)
IVF-PQ code memory:  ~0.1 MB (8 bytes per vector)
Compression ratio:   ~32x

The compression is dramatic. To appreciate just how much space is saved, we quantify the compression step by step, tracing the calculation from the original vector representation down to the compact PQ code.

Original vector size:

Each vector in float32 format requires 4 bytes per dimension. A 32-bit floating-point number can represent a continuous range of values with roughly 7 decimal digits of precision, which is the standard representation used in most machine learning frameworks:

\begin{aligned} \text{Bytes per vector} &= d \times 4 && \text{(}d \text{ dimensions, 4 bytes per float32)} \\ &= 64 \times 4 && \text{(substitute } d = 64\text{)} \\ &= 256 \text{ bytes} && \text{(total size per vector)} \end{aligned}

where:

$d = 64$ : the dimensionality of each vector
Factor of 4: bytes required for a float32 value (32 bits = 4 bytes)

This represents the uncompressed storage cost: every dimension requires a full floating-point number. For a dataset of one million vectors, this amounts to 256 megabytes just for the vector data alone.

PQ code size:

Product Quantization compresses each vector by dividing it into sub-vectors and encoding each with a single byte. Rather than storing 8 continuous floating-point values per sub-vector, PQ replaces each sub-vector with the index of the closest entry in a learned codebook. Since each index is a single byte, it can reference up to 256 different codewords.

\begin{aligned} \text{Bytes per PQ code} &= m \times 1 && \text{(}m \text{ sub-quantizers, 1 byte each)} \\ &= 8 \times 1 && \text{(substitute } m = 8\text{)} \\ &= 8 \text{ bytes} && \text{(total compressed size)} \end{aligned}

where:

$m = 8$ : the number of sub-quantizers (sub-vectors) used in Product Quantization
Each sub-quantizer uses 8 bits (1 byte) to encode its portion of the vector

Each byte can represent 256 ( $2^8$ ) different values, so we're essentially replacing $d/m = 64/8 = 8$ continuous dimensions with a single discrete code from a learned codebook. The codebook is learned during training and shared across all vectors, so its storage cost is amortized and negligible relative to the vector data.

The PQ code represents the vector in just 8 bytes compared to the original 256 bytes, a compression factor of 32. The tradeoff is some loss in recall since PQ introduces quantization error in the distance computations. The distances computed from PQ codes are approximations of the true Euclidean distances, and this approximation error can cause some nearest neighbors to be ranked incorrectly.

In[23]:

Code

# Compare recall: IVF-Flat vs IVF-PQ
nprobe_compare = 10
index.nprobe = nprobe_compare
index_ivfpq.nprobe = nprobe_compare

_, indices_flat = index.search(queries, top_k)
_, indices_pq = index_ivfpq.search(queries, top_k)

recall_flat = 0
recall_pq = 0
for i in range(n_queries):
    truth = set(gt_indices[i].tolist())
    recall_flat += len(set(indices_flat[i].tolist()) & truth) / top_k
    recall_pq += len(set(indices_pq[i].tolist()) & truth) / top_k

recall_flat /= n_queries
recall_pq /= n_queries

# Compare recall: IVF-Flat vs IVF-PQ
nprobe_compare = 10
index.nprobe = nprobe_compare
index_ivfpq.nprobe = nprobe_compare

_, indices_flat = index.search(queries, top_k)
_, indices_pq = index_ivfpq.search(queries, top_k)

recall_flat = 0
recall_pq = 0
for i in range(n_queries):
    truth = set(gt_indices[i].tolist())
    recall_flat += len(set(indices_flat[i].tolist()) & truth) / top_k
    recall_pq += len(set(indices_pq[i].tolist()) & truth) / top_k

recall_flat /= n_queries
recall_pq /= n_queries

Out[24]:

Console

Recall@10 with nprobe=10:
  IVF-Flat: 0.854
  IVF-PQ:   0.400
  Recall drop from PQ compression: 0.454

The results show that IVF-PQ achieves strong recall (typically 90-95% of IVF-Flat's performance) while compressing vectors by 32x. This modest accuracy drop is well worth the dramatic memory savings and enables billion-scale deployments where storing full-precision vectors would be prohibitively expensive. The compression makes IVF-PQ the practical choice for very large collections. To quantify the scale: one billion vectors compressed to 8 bytes each requires only 8 GB of memory, versus 256 GB for uncompressed storage, making the difference between feasible and infeasible deployments.

Out[26]:

Visualization

Accuracy-compression tradeoff comparing IVF-Flat and IVF-PQ across nprobe settings. IVF-Flat (blue circles, 256 bytes per vector) serves as the uncompressed baseline, while IVF-PQ (orange squares, 8 bytes per vector) achieves 32x compression via Product Quantization. Despite this dramatic compression, IVF-PQ maintains 90 to 95% of IVF-Flat's recall across all nprobe values. Quantization error introduces only modest accuracy loss. At nprobe=20, IVF-PQ achieves 95% recall while using just 3% of the memory. The curves' near-parallel behavior reveals that compression preserves IVF's fundamental accuracy-speed tradeoff, enabling billion-scale search with limited memory.

The training cost increases with the number of vectors and the number of clusters. For a dataset of 10,000 vectors and 100 clusters, k-means typically converges within a few seconds on a modern laptop. For datasets with millions of vectors or hundreds of thousands of clusters, training can take minutes to hours. To speed up index construction, FAISS provides options like multi-threading and sampling-based k-means variants.

IVF vs HNSWLink Copied

Both IVF and HNSW (covered in the previous chapter) are workhorses of approximate nearest neighbor search, but they make fundamentally different tradeoffs. Understanding when to use each one is critical for building effective RAG systems.

Structural DifferencesLink Copied

HNSW builds a multi-layer proximity graph where each vector is a node connected to its approximate nearest neighbors. Search proceeds by greedy traversal from the top layer down. The graph structure must be stored alongside the vectors, adding memory overhead.

IVF partitions the space using clustering and stores vectors in inverted lists. Search identifies relevant clusters and scans their contents exhaustively. The overhead is just the centroids, which is minimal.

Memory CharacteristicsLink Copied

HNSW requires storing the graph edges. With a typical parameter $M = 16$ connections per node, each vector requires additional memory for graph storage. We calculate this overhead to understand the concrete cost.

Graph storage per vector:

Each node in the HNSW graph stores $M$ bidirectional connections, where each connection is represented by a 4-byte integer (the node ID). The bidirectionality means that if node A has an edge to node B, then node B also stores an edge back to node A. Both endpoints must record the connection so that the graph can be traversed efficiently in either direction during search:

\begin{aligned} \text{Bytes per vector} &= M \times 2 \times 4 && \text{(bidirectional links as 4-byte integers)} \\ &= 16 \times 2 \times 4 && \text{(substitute } M = 16\text{)} \\ &= 128 \text{ bytes} && \text{(total graph overhead per vector)} \end{aligned}

The factor of 2 accounts for bidirectionality because in HNSW's implementation, each edge is stored at both endpoints to enable fast traversal in either direction. The factor of 4 comes from using 32-bit integers to store node IDs, which supports datasets of up to approximately 2 billion vectors.

where:

$M = 16$ : the maximum number of connections per node in the HNSW graph
Factor of 2: accounts for bidirectional links, each edge stored twice (once at each endpoint)
Factor of 4: bytes per integer, storing node IDs as 32-bit integers

This overhead is independent of vector dimensionality. Whether vectors are 64-D or 768-D, the graph structure costs the same per node. This is an important observation because it means the relative overhead of the graph decreases as vector dimensionality increases.

Overhead percentage for typical embeddings:

To understand whether 128 bytes per vector is significant, we need to compare it against the size of the vectors themselves. For 768-dimensional vectors (common in models like BERT) using 32-bit floats:

\begin{aligned} \text{Vector size} &= 768 \times 4 && \text{(dimensions times bytes per float32)} \\ &= 3{,}072 \text{ bytes} && \text{(total size per vector)} \\ \text{Overhead} &= \frac{128}{3{,}072} && \text{(graph bytes divided by vector bytes)} \\ &\approx 4\% && \text{(percentage overhead)} \end{aligned}

where:

$768$ : the dimensionality of the embedding vectors
$4$ : bytes per float32 value
$128$ : the graph storage overhead per vector, calculated above
$3{,}072$ : the total size of each 768-dimensional float32 vector in bytes

For higher-dimensional vectors, this percentage overhead decreases (graph cost stays constant while vector cost grows), making HNSW more attractive for very high-dimensional spaces. Conversely, for lower-dimensional vectors (say, 64 dimensions requiring only 256 bytes), the graph overhead of 128 bytes represents a 50% increase in memory, which is much more significant.

This overhead is modest, but the key constraint is that the entire graph structure must reside in memory for efficient traversal.

IVF stores centroids ( $k \times d$ floats) and vector assignments, which is negligible overhead. More importantly, IVF naturally supports on-disk storage. The inverted lists can be memory-mapped, and only the lists being probed need to be loaded into memory. This makes IVF far more suitable for datasets that don't fit in RAM.

Speed CharacteristicsLink Copied

HNSW typically offers better query latency for the same recall level, especially at high recall targets (>95%). The graph traversal naturally focuses on the relevant region of the space and adapts to the local density of the data. HNSW query time is also more predictable, as it doesn't depend on cell sizes.

IVF query time depends heavily on the uniformity of the clustering. If some cells are much larger than others (due to uneven data distributions), probing those cells is disproportionately expensive. However, IVF's simple structure makes it easier to parallelize. You can scan multiple inverted lists concurrently.

Index ConstructionLink Copied

IVF requires a training phase (running k-means) before vectors can be added, but this training is relatively fast and adding new vectors is trivial (just assign them to the nearest centroid). Removing vectors is also straightforward.

HNSW builds incrementally. Each vector is inserted by finding its neighbors and adding edges. This can be slow for large datasets and the insertion order can affect quality. Deletion is also more complex because removing a node requires reconnecting its neighbors.

Practical ComparisonLink Copied

The following table summarizes the key tradeoffs:

Comparison of IVF and HNSW index characteristics.

Characteristic	IVF	HNSW
Memory overhead	Low (centroids only)	Medium (graph edges)
Disk-friendly	Yes (memory-mapped lists)	No (requires RAM)
Query latency	Good	Better
Recall at low latency	Good	Better
Build time	Fast (k-means + assignment)	Slower (incremental insertion)
Update support	Easy add and remove	Easy add, hard remove
Scalability	Excellent (with PQ)	Good (limited by RAM)
Tuning complexity	Medium ( $k$ , nprobe)	Medium ( $M$ , ef)

Out[27]:

Visualization

Memory footprint per vector across index types and embedding dimensions. The grouped bar chart shows IVF-Flat (blue) stores full-precision vectors scaling linearly at 4 bytes per dimension, HNSW (orange) adds 128-byte graph overhead (M=16 connections) representing 50% overhead at d=64 but only 4% at d=768, and IVF-PQ (green) achieves constant 8-byte footprint regardless of dimensionality. IVF-PQ compression ranges from 32x at d=64 to 512x at d=1024 compared to uncompressed storage. HNSW's overhead diminishes proportionally at higher dimensions, while IVF-PQ remains constant. This difference explains why IVF-PQ alone enables billion-scale deployments where memory constraints make full-precision indexes infeasible.

When to Use WhichLink Copied

Choose IVF when:

Your dataset is too large to fit entirely in RAM
You need to combine with Product Quantization for compression
You need easy updates (adding/removing vectors frequently)
You're operating at billion-scale

Choose HNSW when:

Your dataset fits in memory
You need the highest possible recall at low latency
You can tolerate the memory overhead of the graph
You have fewer than ~10 million vectors

In many production systems, the two approaches are combined. FAISS supports an IndexHNSW as the coarse quantizer in an IVF index, replacing the flat centroid comparison with an HNSW graph over the centroids. This provides fast centroid lookup when $k$ is very large.

Limitations and ImpactLink Copied

IVF's greatest strength is its simplicity, which is also the source of its main limitation. The quality of the partitioning depends entirely on how well k-means captures the structure of the data. K-means assumes roughly spherical, equally sized clusters. This assumption is rarely satisfied in real embedding spaces. High-dimensional embedding vectors often lie on complex manifolds where Euclidean clustering creates suboptimal boundaries. Vectors near the edges of their assigned cells are the most likely to be missed during search, and in high dimensions, a growing fraction of vectors sit near cell boundaries. This "boundary effect" means that IVF often needs higher $\texttt{nprobe}$ values in high-dimensional spaces than you might expect from the ratio $\texttt{nprobe}/k$ alone. One mitigation is to use more centroids, increasing the granularity of the partitioning and requiring correspondingly more probes. However, this increases both the centroid comparison cost and the training time.

Another practical limitation is that IVF requires a training step that sees a representative sample of the data before vectors can be indexed. This means you need a substantial number of vectors available upfront. If your data distribution shifts over time (for example, as new documents are added in a RAG system), the centroids may become stale and degrade recall. Periodic retraining is necessary in dynamic settings, which adds operational complexity compared to HNSW's incremental insertion model.

Despite these limitations, IVF remains one of the most widely deployed vector index structures. Its low memory overhead, natural compatibility with compression techniques like Product Quantization, and support for disk-based storage make it practical for large-scale retrieval. Nearly every major vector database, including Milvus, Pinecone, Weaviate, and Qdrant, offers IVF-based indexes. In the context of RAG systems, where you might need to search over millions of document chunks, IVF provides the scalability that makes dense retrieval feasible at production scale.

SummaryLink Copied

The IVF index partitions the vector space using k-means clustering, organizing vectors into cells defined by their nearest centroid. At query time, only the $\texttt{nprobe}$ closest cells are searched, dramatically reducing the number of distance computations from $n$ to roughly $\texttt{nprobe} \times n/k$ . The key takeaways from this chapter are:

Clustering as partitioning: K-means divides the space into Voronoi cells, and each cell maintains an inverted list of its member vectors. The number of centroids $k$ (typically between $4\sqrt{n}$ and $16\sqrt{n}$ , where $n$ is the number of vectors) controls the granularity of the partitioning.
nprobe controls the tradeoff: Searching more cells improves recall but increases latency linearly. The sweet spot depends on your data distribution and accuracy requirements.
IVF-PQ for scale: Combining IVF with Product Quantization compresses vectors from full precision to compact codes (often 8 to 32 bytes per vector) and enables billion-scale search with limited memory.
IVF vs HNSW: IVF excels in large-scale, memory-constrained, and disk-based scenarios. HNSW provides better latency-recall tradeoffs when the dataset fits in memory. Many production systems use elements of both.

In the next chapter, we'll explore Product Quantization, understanding how it compresses high-dimensional vectors into compact codes while maintaining sufficient distance information for effective approximate search.

Key ParametersLink Copied

The key parameters for IVF are:

k (number of centroids): Controls the granularity of space partitioning. Typical values range from $4\sqrt{n}$ to $16\sqrt{n}$ where $n$ is the dataset size. More centroids create smaller cells but increase centroid comparison cost.
nprobe: Number of cells to search at query time. Controls the accuracy-speed tradeoff. Higher values improve recall but increase query latency linearly.
n_init: Number of k-means initializations during training. Higher values improve clustering quality but increase training time.
max_iter: Maximum iterations for k-means convergence. Typical values are 20-300 depending on dataset size and convergence requirements.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about IVF indexing for vector search.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

HNSW Index: Architecture for Fast Vector Search

Next Chapter

Product Quantization

Coming Soon

Reference

BIBTEXAcademic

@misc{ivfindexclusteringbasedvectorsearchpartitioning, author = {Michael Brenndoerfer}, title = {IVF Index: Clustering-Based Vector Search & Partitioning}, year = {2026}, url = {https://mbrenndoerfer.com/writing/ivf-index-clustering-vector-search-partitioning}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2026). IVF Index: Clustering-Based Vector Search & Partitioning. Retrieved from https://mbrenndoerfer.com/writing/ivf-index-clustering-vector-search-partitioning

MLAAcademic

Michael Brenndoerfer. "IVF Index: Clustering-Based Vector Search & Partitioning." 2026. Web. today. <https://mbrenndoerfer.com/writing/ivf-index-clustering-vector-search-partitioning>.

CHICAGOAcademic

Michael Brenndoerfer. "IVF Index: Clustering-Based Vector Search & Partitioning." Accessed today. https://mbrenndoerfer.com/writing/ivf-index-clustering-vector-search-partitioning.

HARVARDAcademic

Michael Brenndoerfer (2026) 'IVF Index: Clustering-Based Vector Search & Partitioning'. Available at: https://mbrenndoerfer.com/writing/ivf-index-clustering-vector-search-partitioning (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2026). IVF Index: Clustering-Based Vector Search & Partitioning. https://mbrenndoerfer.com/writing/ivf-index-clustering-vector-search-partitioning

Direct link:

https://mbrenndoerfer.com/writing/ivf-index-clustering-vector-search-partitioning

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

IVF Index: Clustering-Based Vector Search & Partitioning

IVF IndexLink Copied

The Clustering ApproachLink Copied

Index ConstructionLink Copied

Query ProcessingLink Copied

Probe Count: The Accuracy-Speed TradeoffLink Copied

Analyzing the Computational CostLink Copied

Worked ExampleLink Copied

Code ImplementationLink Copied

Building IVF from ScratchLink Copied

Measuring Recall Across Many QueriesLink Copied

Using FAISS for IVFLink Copied

Visualizing the Voronoi PartitioningLink Copied

Choosing the Number of CentroidsLink Copied

IVF-PQ: Combining Clustering with CompressionLink Copied

IVF vs HNSWLink Copied

Structural DifferencesLink Copied

Memory CharacteristicsLink Copied

Speed CharacteristicsLink Copied

Index ConstructionLink Copied

Practical ComparisonLink Copied

When to Use WhichLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

HNSW Index: Architecture for Fast Vector Search

Vector Similarity Search: Metrics & Approximate Methods

Embedding Models: Architecture, Pooling & Selection

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

HNSW Index: Architecture for Fast Vector Search

Vector Similarity Search: Metrics & Approximate Methods

Embedding Models: Architecture, Pooling & Selection

Stay updated