Dense Retrieval: Semantic Search & Bi-Encoder Implementation

Michael BrenndoerferJanuary 22, 202635 min read

Master dense retrieval for semantic search. Explore bi-encoder architectures, embedding metrics, and contrastive learning to overcome keyword limitations.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Dense Retrieval

In the previous chapter, we examined the RAG architecture and how it combines retrieval with generation. At the heart of this system lies a critical question: how do we find the most relevant documents for a given query? Traditional search engines rely on lexical matching, counting how often query terms appear in documents. But what if the document uses different words to express the same concept? What if "automobile" appears in the document but you search for "car"?

Dense retrieval addresses this fundamental limitation by representing both queries and documents as continuous vectors in a shared semantic space. Rather than matching exact words, dense retrieval measures the similarity between the meanings of queries and documents. A query about "climate change impacts" can match documents discussing "global warming effects" because both map to nearby points in the embedding space, even though they share no common terms.

This shift from discrete term matching to continuous semantic similarity is a significant advance in information retrieval. Building on the transformer architectures and embedding techniques we've explored throughout this book, dense retrieval enables retrieval systems that understand language rather than merely counting words.

From Lexical to Semantic Matching

Recall from Part II that BM25 retrieves documents by computing term-frequency statistics: documents score higher when they contain more query terms, with diminishing returns for repeated terms and penalties for common words. This approach works remarkably well for many queries, but it fails when queries and relevant documents use different vocabulary.

Consider searching a medical knowledge base with the query "heart attack symptoms." A relevant document might discuss "myocardial infarction warning signs" without ever using the words "heart" or "attack." BM25 would score this document at zero because it shares no terms with the query. Yet a physician would recognize these as describing the same condition.

This vocabulary mismatch problem becomes acute in several scenarios:

  • Synonyms and paraphrases: "car" vs "automobile," "purchase" vs "buy"
  • Technical vs casual language: "myocardial infarction" vs "heart attack"
  • Abbreviations and expansions: "ML" vs "machine learning"
  • Conceptual similarity: "renewable energy policy" might relate to documents about "solar panel subsidies"

Dense retrieval sidesteps vocabulary mismatch entirely. Instead of asking "which words match?", it asks "which meanings match?" By encoding both queries and documents into a continuous vector space where semantically similar texts cluster together, dense retrieval can find relevant documents regardless of the specific words they use.

The Bi-Encoder Architecture

The dominant architecture for dense retrieval is the bi-encoder, which uses two separate encoder networks to produce embeddings for queries and documents independently. This separation is crucial for efficiency at scale, and understanding why requires us to think carefully about what happens during retrieval.

Bi-Encoder

A neural architecture that encodes queries and documents using separate (but often identical) transformer encoders, producing fixed-dimensional vectors that can be compared using simple similarity metrics.

Architecture Overview

The bi-encoder consists of two components that work in tandem to transform text into comparable numerical representations. The first component is the query encoder, denoted EqE_q, which takes a natural language query qq as input and produces a dense vector representation. The second component is the document encoder, denoted EdE_d, which performs the analogous transformation for documents.

Formally, we can express these transformations as follows:

  • Query encoder EqE_q: Maps a query qq to a dense vector q=Eq(q)RD\mathbf{q} = E_q(q) \in \mathbb{R}^D
  • Document encoder EdE_d: Maps a document dd to a dense vector d=Ed(d)RD\mathbf{d} = E_d(d) \in \mathbb{R}^D

Both encoders are typically initialized from the same pre-trained transformer (such as BERT, as we covered in Part XVII), and they may share weights or be fine-tuned independently. The encoders produce embeddings of the same dimensionality DD, enabling direct comparison. This shared dimensionality is essential because it allows us to measure distances and angles between query and document vectors in the same geometric space.

A key challenge in building these encoders is converting the variable-length output of a transformer into a single fixed-size vector suitable for comparison. Recall that when a BERT model processes an input sequence, it produces a hidden state vector for each token in the sequence. To condense this sequence of token representations into a single fixed-size vector, BERT-based encoders typically extract the representation of the special [CLS] token from the final layer:

q=BERT(q)[CLS]\mathbf{q} = \text{BERT}(q)_{[\text{CLS}]}

where:

  • q\mathbf{q} is the dense vector representation of the query
  • BERT(q)\text{BERT}(q) is the sequence of hidden states from the last layer of the BERT model
  • [CLS][\text{CLS}] indicates extraction of the vector corresponding to the special classification token (designed to aggregate sequence-level information)

Alternatively, to capture information distributed across the entire sequence rather than relying on a single token, some models compute the average of all token representations. This approach, known as mean pooling, treats each token's contribution equally:

q=1ni=1nBERT(q)i\mathbf{q} = \frac{1}{n} \sum_{i=1}^{n} \text{BERT}(q)_i

where:

  • q\mathbf{q} is the dense vector representation of the query
  • nn is the number of tokens in the query
  • BERT(q)i\text{BERT}(q)_i is the vector representation of the ii-th token
  • \sum computes the element-wise sum of all token vectors to find the geometric center

The intuition behind mean pooling is that important semantic information may be spread across multiple tokens rather than concentrated in the [CLS] token. By averaging, we create a representation that balances contributions from all parts of the input. The choice between [CLS] pooling and mean pooling affects retrieval quality, and different models adopt different strategies based on their training objectives. Empirically, models trained with mean pooling as part of their objective tend to perform better when evaluated using mean pooling, and likewise for [CLS] pooling.

Why Separate Encoders?

The bi-encoder's separation of query and document encoding enables a critical optimization: pre-computation of document embeddings. To appreciate why this matters, consider the computational demands of a retrieval system. In a retrieval system with millions of documents, we can encode all documents offline, storing their embeddings in a vector index. At query time, we only need to encode the query once, then compare it against the pre-computed document embeddings.

This contrasts with cross-encoders, which concatenate the query and document and process them jointly through a single transformer. Cross-encoders can capture fine-grained interactions between query and document tokens, often achieving higher accuracy, but they require running the transformer once for every query-document pair. For a corpus of 10 million documents, this means 10 million transformer forward passes per query, which is computationally infeasible.

The bi-encoder architecture trades some modeling power for massive efficiency gains:

Comparison of Bi-Encoder and Cross-Encoder architectures.
AspectBi-EncoderCross-Encoder
Query-time computationEncode query onceEncode query-doc pairs
Document pre-computationYes (offline)No
Query-document interactionNone (independent encoding)Full attention
Typical useFirst-stage retrievalReranking
ScalabilityMillions of documentsHundreds of candidates

We'll explore reranking with cross-encoders in a later chapter, where they serve as a second-stage refinement over bi-encoder candidates.

Embedding Similarity Metrics

Once we have query and document embeddings, we need a similarity function to rank documents. The three most common metrics are dot product, cosine similarity, and Euclidean distance. Each of these metrics captures a different notion of what it means for two vectors to be "similar," and understanding their geometric interpretations helps us choose the right metric for a given application.

Dot Product

The dot product quantifies the similarity between two vectors by aggregating their aligned components. To understand this intuitively, imagine two vectors as arrows pointing in some direction in high-dimensional space. The dot product measures how much one vector "projects" onto another, combining information about both their directions and their lengths. For vectors q\mathbf{q} and d\mathbf{d}, the dot product is defined as:

sim(q,d)=qd=i=1Dqidi\text{sim}(\mathbf{q}, \mathbf{d}) = \mathbf{q} \cdot \mathbf{d} = \sum_{i=1}^{D} q_i d_i

where:

  • sim(q,d)\text{sim}(\mathbf{q}, \mathbf{d}) is the similarity score
  • q,d\mathbf{q}, \mathbf{d} are the dense vectors for the query and document
  • DD is the dimensionality of the embedding space (e.g., 768)
  • qi,diq_i, d_i are the ii-th scalar components
  • \sum aggregates alignment across all dimensions

The geometric interpretation of this formula is instructive. Each dimension of the embedding space captures some aspect of meaning. When both qiq_i and did_i are large and positive, their product contributes positively to the similarity, indicating that both the query and document exhibit that particular semantic feature strongly. Conversely, when one is positive and the other negative, the contribution is negative, reducing overall similarity.

The dot product is computationally efficient and captures both the alignment of vectors (their angular similarity) and their magnitudes. Larger embeddings produce larger dot products, which can be useful when magnitude carries semantic meaning. For example, longer, more detailed documents might have larger embedding norms, and this property could be desirable if we want to favor comprehensive documents.

Many dense retrieval models, including DPR (Dense Passage Retrieval), use dot product similarity because it can be computed extremely efficiently using matrix multiplication. Given a query vector and a matrix of document vectors, we can compute all similarity scores in a single operation, leveraging highly optimized linear algebra libraries.

Cosine Similarity

While the dot product captures both direction and magnitude, there are situations where we want to focus purely on semantic direction, ignoring how "long" the vectors are. Cosine similarity addresses this need by normalizing the dot product by the magnitudes of both vectors, measuring only the angular alignment:

cos(q,d)=qdqd=i=1Dqidii=1Dqi2i=1Ddi2\begin{aligned} \cos(\mathbf{q}, \mathbf{d}) &= \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\| \|\mathbf{d}\|} \\ &= \frac{\sum_{i=1}^{D} q_i d_i}{\sqrt{\sum_{i=1}^{D} q_i^2} \sqrt{\sum_{i=1}^{D} d_i^2}} \end{aligned}

where:

  • cos(q,d)\cos(\mathbf{q}, \mathbf{d}) is the cosine similarity score
  • qd\mathbf{q} \cdot \mathbf{d} is the dot product (unnormalized projection)
  • q,d\|\mathbf{q}\|, \|\mathbf{d}\| are the Euclidean norms (lengths) of the vectors
  • qi2\sqrt{\sum q_i^2} calculates the vector magnitude

The cosine similarity ranges from 1-1 (opposite directions) to +1+1 (same direction), with 00 indicating orthogonality, meaning the vectors are perpendicular in the embedding space. By removing magnitude effects, cosine similarity focuses purely on semantic direction. This is particularly useful when we want to compare texts of different lengths on an equal footing, since a longer document might naturally produce a larger-magnitude embedding without being more relevant.

A useful property emerges when embeddings are L2-normalized, meaning each vector is divided by its magnitude so that q=d=1\|\mathbf{q}\| = \|\mathbf{d}\| = 1. In this case, cosine similarity equals the dot product:

cos(q^,d^)=q^d^\cos(\hat{\mathbf{q}}, \hat{\mathbf{d}}) = \hat{\mathbf{q}} \cdot \hat{\mathbf{d}}

Here, q^\hat{\mathbf{q}} and d^\hat{\mathbf{d}} are the unit-length vectors derived by dividing q\mathbf{q} and d\mathbf{d} by their respective norms. The dot product operation, applied to these normalized vectors, now directly yields the cosine of the angle between the original vectors.

This equivalence is useful because we can pre-normalize all document embeddings and use the faster dot product operation while still computing cosine similarity. This is exactly what many production systems do: they normalize embeddings once during indexing, then use simple dot products during retrieval.

Euclidean Distance

The third common metric takes a different perspective entirely. Rather than measuring how vectors align, Euclidean (L2) distance measures the straight-line distance between vectors in the embedding space:

dist(q,d)=qd=i=1D(qidi)2\text{dist}(\mathbf{q}, \mathbf{d}) = \|\mathbf{q} - \mathbf{d}\| = \sqrt{\sum_{i=1}^{D} (q_i - d_i)^2}

where:

  • dist(q,d)\text{dist}(\mathbf{q}, \mathbf{d}) is the Euclidean distance
  • qd\mathbf{q} - \mathbf{d} is the difference vector
  • (qidi)2(q_i - d_i)^2 is the squared difference in dimension ii
  • \sqrt{\cdot} converts the sum of squared differences to linear distance

Unlike the similarity metrics above, smaller distances indicate greater similarity. This is an important distinction to keep in mind when implementing retrieval systems: with Euclidean distance, we search for the nearest neighbors rather than the highest-scoring documents.

Euclidean distance is related to dot product for normalized vectors. When vectors have unit length, minimizing Euclidean distance is mathematically equivalent to maximizing the dot product. This relationship allows certain indexing algorithms designed for one metric to be adapted for the other.

Out[2]:
Visualization
Using Python 3.11.13 environment at: /Users/michaelbrenndoerfer/tinker/mb/.venv
Audited 2 packages in 27ms
Geometric representation of query and document embeddings in vector space. The query vector (A) forms a smaller angle ($\theta_1$) with the semantically similar document (B) than with the dissimilar document (C, $\theta_2$), illustrating how angular alignment captures relevance. The larger magnitude of document C demonstrates why unnormalized dot products can be misleading.
Geometric representation of query and document embeddings in vector space. The query vector (A) forms a smaller angle ($\theta_1$) with the semantically similar document (B) than with the dissimilar document (C, $\theta_2$), illustrating how angular alignment captures relevance. The larger magnitude of document C demonstrates why unnormalized dot products can be misleading.
Out[3]:
Visualization
Comparison of similarity metrics for the vectors shown in the previous figure. Cosine similarity correctly assigns a higher score to the semantically similar document (B), reflecting its angular alignment with the query. In contrast, the unnormalized dot product incorrectly favors the dissimilar document (C) due to its larger vector magnitude.
Comparison of similarity metrics for the vectors shown in the previous figure. Cosine similarity correctly assigns a higher score to the semantically similar document (B), reflecting its angular alignment with the query. In contrast, the unnormalized dot product incorrectly favors the dissimilar document (C) due to its larger vector magnitude.

Choosing a Metric

The choice of similarity metric should match the training objective used when the embedding model was created:

  • Models trained with dot product loss perform best with dot product similarity
  • Models trained with cosine loss perform best with cosine similarity
  • Many embedding models normalize their outputs, making the choice less critical

In practice, cosine similarity (or equivalently, dot product with normalized embeddings) is the most common choice because it's robust to variations in embedding magnitude and provides easily interpretable scores. When similarity ranges from 0 to 1, it becomes straightforward to set thresholds and compare scores across different queries.

Dense vs Sparse Retrieval: A Detailed Comparison

Understanding when to use dense versus sparse retrieval requires examining their complementary strengths and weaknesses.

Strengths of Dense Retrieval

Dense retrieval excels in scenarios where semantic understanding matters more than exact term matching:

Semantic matching: Dense retrieval captures meaning beyond surface forms. The query "best laptop for programming" matches documents about "developer-friendly notebooks with good keyboards" even without shared terms.

Handling synonyms: Medical queries like "hypertension treatment" naturally match documents about "high blood pressure medication" because both concepts map to similar regions in the embedding space.

Cross-lingual retrieval: With multilingual encoders, the same query can retrieve relevant documents in different languages, as semantically equivalent text in different languages clusters together in the embedding space.

Robustness to typos: Minor spelling variations ("recieve" vs "receive") often produce similar embeddings, whereas exact-match systems would fail entirely.

Strengths of Sparse Retrieval

Sparse retrieval (BM25, TF-IDF) maintains advantages in several areas:

Exact match requirements: When you search for specific identifiers like "CVE-2024-1234" or "iPhone 15 Pro Max," exact term matching is essential. Dense models may confuse similar but distinct identifiers.

Rare terms and proper nouns: Uncommon terms carry high information value in BM25 (high IDF weights). Dense models may underweight rare terms that weren't well-represented in training data.

Interpretability: BM25 scores are explainable: "this document ranked high because it contains the query terms 'machine' (3 times) and 'learning' (5 times)." Dense similarity scores are opaque.

Zero-shot generalization: BM25 works immediately on any corpus without training. Dense retrievers need training data that matches the target domain.

Efficiency: BM25 uses inverted indices that scale efficiently to billions of documents. Dense retrieval requires vector indices that are more memory-intensive.

When Each Approach Fails

Dense retrieval struggles with:

  • Keyword-heavy queries: Searching for "Python pandas DataFrame merge" requires exact library and function names
  • Out-of-domain text: Models trained on Wikipedia may perform poorly on legal contracts
  • Negation: "hotels without pools" might match documents about "hotels with pools" because both contain similar concepts
  • Numerical reasoning: 'apartments under \$2000/month' requires understanding numerical constraints

Sparse retrieval struggles with:

  • Vocabulary mismatch: "affordable housing" vs "low-cost apartments"
  • Paraphrased queries: "what causes climate change" vs "global warming factors"
  • Conceptual queries: "books like Harry Potter" requires understanding genre and style
  • Short queries: Single-word queries provide little context for term weighting

The Case for Hybrid Approaches

Given these complementary strengths, many production systems combine both approaches. A typical hybrid strategy:

  1. Run both dense and sparse retrieval in parallel
  2. Normalize scores from each system
  3. Combine scores with learned or tuned weights
  4. Return the top-k documents from the merged ranking

We'll explore hybrid search techniques in detail in a later chapter on combining retrieval signals.

Training Dense Retrievers

Creating effective dense retrievers requires specialized training to produce embeddings where similar queries and documents cluster together. This section covers the key components of the training process, from the mathematical objective that guides learning to the practical considerations of data collection and negative sampling.

The Training Objective

The goal of dense retriever training is to learn an embedding space where:

  • Relevant query-document pairs have high similarity
  • Irrelevant query-document pairs have low similarity

This is typically formulated as a contrastive learning problem. The core idea is deceptively simple: teach the model to distinguish between documents that satisfy your information need and documents that do not. Given a query qq, a relevant (positive) document d+d^+, and irrelevant (negative) documents {d1,d2,...,dk}\{d^-_1, d^-_2, ..., d^-_k\}, we want:

sim(Eq(q),Ed(d+))>sim(Eq(q),Ed(di))i\text{sim}(E_q(q), E_d(d^+)) > \text{sim}(E_q(q), E_d(d^-_i)) \quad \forall i

where:

  • sim()\text{sim}(\cdot) is the similarity function (e.g., dot product)
  • Eq(q),Ed()E_q(q), E_d(\cdot) are the encoder embeddings
  • d+d^+ is the relevant (positive) document
  • did^-_i is the ii-th irrelevant (negative) document
  • i\forall i indicates the condition holds for all negative samples

The intuition behind this formulation is geometric: we want the query embedding to be closer to the positive document embedding than to any negative document embedding in the vector space.

The most common loss function is the contrastive loss (also called InfoNCE loss). This function treats retrieval as a classification task. It first computes the probability that the positive document is the correct match among the set of negatives, then minimizes the negative log of that probability. The mathematical formulation is:

P(d+q)=exp(sim(q,d+)/τ)exp(sim(q,d+)/τ)+i=1kexp(sim(q,di)/τ)L=logP(d+q)\begin{aligned} P(d^+|q) &= \frac{\exp(\text{sim}(\mathbf{q}, \mathbf{d}^+) / \tau)}{\exp(\text{sim}(\mathbf{q}, \mathbf{d}^+) / \tau) + \sum_{i=1}^{k} \exp(\text{sim}(\mathbf{q}, \mathbf{d}^-_i) / \tau)} \\ \mathcal{L} &= -\log P(d^+|q) \end{aligned}

where:

  • P(d+q)P(d^+|q) is the probability that d+d^+ is the correct match
  • L\mathcal{L} is the loss to minimize
  • q,d+,di\mathbf{q}, \mathbf{d}^+, \mathbf{d}^-_i are the embeddings for query, positive, and negative docs
  • τ\tau is the temperature parameter
  • kk is the number of negative samples

The temperature parameter τ\tau controls the sharpness of the probability distribution and deserves special attention. When τ\tau is small (close to 0), the exponential function amplifies differences between similarity scores, making the model more confident in its distinctions. When τ\tau is large, the probability distribution becomes more uniform, making the training signal weaker. Finding the right temperature is often done through experimentation.

In the numerator, exp(sim(q,d+)/τ)\exp(\text{sim}(\mathbf{q}, \mathbf{d}^+) / \tau) computes the exponential score for the positive pair. In the denominator, i=1kexp(sim(q,di)/τ)\sum_{i=1}^{k} \exp(\text{sim}(\mathbf{q}, \mathbf{d}^-_i) / \tau) sums the exponential scores for all kk negative samples. Combined with the positive pair's score, this sum forms the normalization constant that ensures probabilities sum to one.

This loss pushes the model to increase the similarity between query and positive document while decreasing similarity with negatives. We'll explore contrastive learning in much greater depth in the next chapter.

Out[4]:
Visualization
Visualization of the contrastive learning objective in embedding space. The loss function forces the query (q) and positive document (d+) closer together while pushing negative documents (d-) apart. This push-pull mechanism creates a semantic margin that effectively distinguishes relevant content from irrelevant distractors.
Visualization of the contrastive learning objective in embedding space. The loss function forces the query (q) and positive document (d+) closer together while pushing negative documents (d-) apart. This push-pull mechanism creates a semantic margin that effectively distinguishes relevant content from irrelevant distractors.

Training Data Sources

Dense retrievers require training data consisting of query-document pairs with relevance labels. Common sources include:

Natural Questions (NQ): Google's dataset of real questions paired with Wikipedia passages containing answers. This provides natural query distribution but limited to Wikipedia domain.

MS MARCO: Microsoft's large-scale reading comprehension dataset with web queries and relevant passages. Its scale (500k+ queries) and web domain make it popular for training general-purpose retrievers.

Synthetic data: Using LLMs to generate queries for existing documents. Given a passage, the model generates questions that the passage would answer. This allows creating training data for any domain.

Click logs: In production systems, your clicks provide implicit relevance signals. Documents that you click after a query are treated as positive examples.

Negative Sampling Strategies

The choice of negative documents significantly impacts training quality. Several strategies exist:

Random negatives: Sample random documents from the corpus. Simple but often too easy: random documents are typically obviously irrelevant.

BM25 negatives: Use BM25 to find documents that match query terms but aren't relevant. These "hard negatives" share surface features with the query but differ semantically, forcing the model to learn beyond lexical overlap.

In-batch negatives: Use positive documents from other queries in the same training batch as negatives. Efficient because it requires no additional encoding, but the negatives may be too easy.

Self-mined hard negatives: Use the current model to find high-scoring documents that aren't labeled as relevant. These are the hardest negatives because the current model finds them confusable with true positives.

Research has shown that combining these strategies often works best: start with random negatives for initial training, then fine-tune with BM25 and self-mined hard negatives.

The Training Pipeline

A typical dense retriever training pipeline:

  1. Initialize query and document encoders from pre-trained BERT
  2. Construct batches with queries, positive documents, and sampled negatives
  3. Encode all queries and documents in the batch
  4. Compute pairwise similarities and contrastive loss
  5. Backpropagate through both encoders
  6. Repeat for multiple epochs, potentially re-mining hard negatives periodically

The training process is computationally intensive because we need to encode many negatives per query to provide a strong training signal. Techniques like in-batch negatives help by reusing computation across the batch.

Worked Example: Semantic Similarity in Action

Let's trace through a concrete example to build intuition for how dense retrieval differs from lexical matching. This example will help cement the abstract concepts we've discussed by showing them in operation on real text.

Consider a query and three candidate documents:

Query: "How do plants make food?"

Document A: "Photosynthesis is the process by which green plants and certain other organisms transform light energy into chemical energy."

Document B: "Plants require sunlight, water, and carbon dioxide to produce glucose and oxygen through a complex series of reactions."

Document C: "My grandmother makes delicious food using fresh vegetables from her garden."

From a BM25 perspective, Document C contains both "plants" (via "vegetables" and "garden" context) and "food," giving it some term overlap with the query. Documents A and B share no direct terms with "How do plants make food?" This illustrates the vocabulary mismatch problem: the most relevant documents use technical terminology ("photosynthesis," "glucose," "chemical energy") that doesn't appear in your natural language query.

A dense retriever, however, produces embeddings that capture semantic meaning rather than surface-level word matches:

  • Query embedding: Encodes the concept of "plant nutrition/energy production"
  • Document A embedding: Clusters near "plant biology, photosynthesis" concepts
  • Document B embedding: Also clusters near "plant biology, photosynthesis"
  • Document C embedding: Clusters near "cooking, family, gardening" concepts

The cosine similarity between query and Document A might be 0.78, between query and Document B might be 0.75, while query and Document C might be only 0.21. Despite Document C's lexical overlap, its semantic meaning is far from the query's intent. The dense retriever recognizes that "making food" in the context of plants refers to biological energy production, not culinary preparation.

Out[5]:
Visualization
Comparison of lexical (BM25) and dense retrieval scores for the query 'How do plants make food?'. The lexical model incorrectly favors the culinary document (C) due to surface-level keyword overlap. The dense retriever correctly assigns higher scores to the scientific documents (A and B), recognizing the semantic relationship between photosynthesis and the query despite the vocabulary mismatch.
Comparison of lexical (BM25) and dense retrieval scores for the query 'How do plants make food?'. The lexical model incorrectly favors the culinary document (C) due to surface-level keyword overlap. The dense retriever correctly assigns higher scores to the scientific documents (A and B), recognizing the semantic relationship between photosynthesis and the query despite the vocabulary mismatch.

This example illustrates the fundamental difference: sparse retrieval operates on surface forms, counting which words appear and how often, while dense retrieval operates on underlying meaning, measuring conceptual similarity in a learned semantic space.

Code Implementation

Let's implement dense retrieval using the sentence-transformers library, which provides pre-trained bi-encoder models optimized for semantic similarity.

In[6]:
Code
!uv pip install sentence-transformers scikit-learn

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.decomposition import PCA
from sentence_transformers import SentenceTransformer

# Load a pre-trained bi-encoder model
# This model was trained on large-scale retrieval datasets
model = SentenceTransformer('all-MiniLM-L6-v2')

The all-MiniLM-L6-v2 model is a compact but effective retriever, producing 384-dimensional embeddings. It was trained on over 1 billion sentence pairs using contrastive learning.

In[7]:
Code
# Sample document corpus about machine learning
documents = [
    "Neural networks learn hierarchical representations of data through multiple layers.",
    "Gradient descent optimizes model parameters by iteratively updating weights.",
    "Transformers use self-attention to capture dependencies regardless of distance.",
    "Random forests combine multiple decision trees for robust predictions.",
    "Support vector machines find optimal hyperplanes to separate classes.",
    "Convolutional neural networks excel at processing grid-structured data like images.",
    "Recurrent neural networks maintain hidden states to process sequential data.",
    "The backpropagation algorithm computes gradients through the chain rule.",
]

# Encode all documents (this would be done offline in production)
document_embeddings = model.encode(
    documents, convert_to_numpy=True, normalize_embeddings=True
)
Out[8]:
Console
Encoded 8 documents
Embedding shape: (8, 384)
Embedding dimension: 384

Each document is now represented as a 384-dimensional vector. In a production system, these embeddings would be stored in a vector index for efficient search.

Computing Similarity Scores

Let's implement a function to retrieve documents for a query:

In[9]:
Code
def dense_retrieve(query, doc_embeddings, docs, model, top_k=3):
    """
    Retrieve top-k documents using dense retrieval.

    Args:
        query: The search query string
        doc_embeddings: Pre-computed document embeddings
        docs: List of document strings
        model: The sentence transformer model
        top_k: Number of documents to retrieve

    Returns:
        List of (document, score) tuples
    """
    # Encode the query
    query_embedding = model.encode(
        query, convert_to_numpy=True, normalize_embeddings=True
    )

    # Compute cosine similarity with all documents
    # For normalized embeddings, this is equivalent to dot product
    similarities = np.dot(doc_embeddings, query_embedding)

    # Get indices of top-k highest scores
    top_indices = np.argsort(similarities)[::-1][:top_k]

    # Return documents with scores
    return [(docs[i], similarities[i]) for i in top_indices]

Let's test our retriever with a query that demonstrates semantic matching:

In[10]:
Code
query = "How do neural networks learn?"
results = dense_retrieve(query, document_embeddings, documents, model, top_k=3)
Out[11]:
Console
Query: How do neural networks learn?

Top retrieved documents:

1. Score: 0.681
   Neural networks learn hierarchical representations of data through multiple layers.

2. Score: 0.532
   The backpropagation algorithm computes gradients through the chain rule.

3. Score: 0.437
   Recurrent neural networks maintain hidden states to process sequential data.

The retriever finds documents about neural network learning, gradient descent (the mechanism for learning), and backpropagation (how gradients are computed). These are semantically related to the query even when they don't contain the exact query terms.

Comparing Dense and Lexical Retrieval

Let's compare dense retrieval to a simple lexical approach:

In[12]:
Code
def bm25_score(query, doc, avg_doc_len, k1=1.5, b=0.75):
    """Compute simplified BM25 score for a single document."""
    query_terms = query.lower().split()
    doc_terms = doc.lower().split()
    doc_len = len(doc_terms)

    # Term frequency in document
    term_freq = Counter(doc_terms)

    score = 0.0
    for term in query_terms:
        if term in term_freq:
            tf = term_freq[term]
            # Simplified BM25 (ignoring IDF for this demo)
            numerator = tf * (k1 + 1)
            denominator = tf + k1 * (1 - b + b * (doc_len / avg_doc_len))
            score += numerator / denominator

    return score


def lexical_retrieve(query, docs, top_k=3):
    """Retrieve using simplified BM25."""
    avg_len = np.mean([len(d.split()) for d in docs])
    scores = [(doc, bm25_score(query, doc, avg_len)) for doc in docs]
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

Now let's test both approaches on a query with vocabulary mismatch:

In[13]:
Code
# Query that uses different terminology
query_mismatch = "deep learning optimization methods"

dense_results = dense_retrieve(
    query_mismatch, document_embeddings, documents, model, top_k=3
)
lexical_results = lexical_retrieve(query_mismatch, documents, top_k=3)
Out[14]:
Console
Query: 'deep learning optimization methods'

Dense Retrieval Results:
  1. [0.513] Gradient descent optimizes model parameters by iteratively updating we...
  2. [0.379] Neural networks learn hierarchical representations of data through mul...
  3. [0.369] The backpropagation algorithm computes gradients through the chain rul...

Lexical (BM25) Results:
  1. [0.000] Neural networks learn hierarchical representations of data through mul...
  2. [0.000] Gradient descent optimizes model parameters by iteratively updating we...
  3. [0.000] Transformers use self-attention to capture dependencies regardless of ...

Dense retrieval finds semantically relevant documents about gradient descent and neural network learning, while simple lexical matching struggles because the query terms don't appear in the documents. This demonstrates the vocabulary mismatch problem that dense retrieval addresses.

Visualizing the Embedding Space

Let's visualize how documents cluster in the embedding space:

Out[15]:
Visualization
Principal Component Analysis (PCA) projection of document embeddings. Documents discussing related concepts (e.g., neural networks, optimization) cluster together, while the query 'How do neural networks learn?' aligns with the relevant semantic neighborhood. This visualization demonstrates how the bi-encoder maps semantically similar text to proximal points in the vector space.
Principal Component Analysis (PCA) projection of document embeddings. Documents discussing related concepts (e.g., neural networks, optimization) cluster together, while the query 'How do neural networks learn?' aligns with the relevant semantic neighborhood. This visualization demonstrates how the bi-encoder maps semantically similar text to proximal points in the vector space.

The visualization shows how documents cluster by topic in the embedding space. The query "How do neural networks learn?" is positioned near documents about neural networks, gradient descent, and backpropagation, enabling semantic retrieval without term matching.

Batch Retrieval for Efficiency

In practice, we encode multiple queries at once for efficiency:

In[16]:
Code
def batch_retrieve(queries, doc_embeddings, docs, model, top_k=3):
    """
    Efficiently retrieve for multiple queries using matrix operations.
    """
    # Encode all queries at once
    query_embeddings = model.encode(
        queries, convert_to_numpy=True, normalize_embeddings=True
    )

    # Compute all similarities with single matrix multiplication
    # Shape: (num_queries, num_docs)
    all_similarities = np.dot(query_embeddings, doc_embeddings.T)

    results = []
    for i, query in enumerate(queries):
        similarities = all_similarities[i]
        top_indices = np.argsort(similarities)[::-1][:top_k]
        query_results = [(docs[j], similarities[j]) for j in top_indices]
        results.append((query, query_results))

    return results


# Test batch retrieval
test_queries = [
    "attention mechanisms in deep learning",
    "ensemble methods for classification",
    "processing sequential information",
]

batch_results = batch_retrieve(
    test_queries, document_embeddings, documents, model, top_k=2
)
Out[17]:
Console
Batch Retrieval Results:

Query: 'attention mechanisms in deep learning'
  [0.451] Neural networks learn hierarchical representations of data t...
  [0.376] Recurrent neural networks maintain hidden states to process ...

Query: 'ensemble methods for classification'
  [0.513] Support vector machines find optimal hyperplanes to separate...
  [0.477] Random forests combine multiple decision trees for robust pr...

Query: 'processing sequential information'
  [0.601] Recurrent neural networks maintain hidden states to process ...
  [0.265] Neural networks learn hierarchical representations of data t...

The matrix multiplication computes all query-document similarities simultaneously, making batch retrieval much more efficient than processing queries one at a time.

Key Parameters

The key parameters for the Dense Retrieval implementation are:

  • model_name_or_path: Argument for SentenceTransformer to select the pre-trained weights (e.g., 'all-MiniLM-L6-v2'). Different models offer different trade-offs between speed, model size, and embedding quality.
  • normalize_embeddings: Argument for model.encode. When set to True, it produces unit-length vectors, enabling dot product to equal cosine similarity.
  • top_k: Argument for the retrieval function determining the number of results to return. In a RAG pipeline, this controls context size.

Limitations and Impact

Dense retrieval has transformed information retrieval, but understanding its limitations is essential for effective deployment.

Key Limitations

Dense retrievers require substantial training data to perform well. Unlike BM25, which works out-of-the-box on any text collection, dense models need labeled query-document pairs that match the target domain. A retriever trained on Wikipedia passages may perform poorly on legal contracts, scientific papers, or code documentation. This domain sensitivity means organizations often need to fine-tune models on their specific data, which requires expertise and labeled examples.

The computational requirements of dense retrieval are significant. Each document must be encoded by a transformer model, and the resulting embeddings must be stored and indexed. For a corpus of 100 million documents with 384-dimensional embeddings, storage alone requires approximately 150 gigabytes. Sparse indices based on inverted lists often require far less memory and can be searched faster for simple keyword queries. The next chapters on vector similarity search and indexing techniques address how to make dense retrieval practical at scale.

Dense retrievers can also fail silently in ways that sparse retrievers don't. When BM25 returns poor results, the explanation is often clear: the query terms don't appear in relevant documents. When a dense retriever fails, diagnosing the problem is harder. The model might not have learned good representations for certain concepts, might confuse similar-sounding but different entities, or might not handle negation correctly. This opacity makes debugging and improving dense retrieval systems more challenging.

Impact on NLP Systems

Despite these limitations, dense retrieval has had enormous impact. By enabling semantic matching at scale, it unlocked new capabilities across NLP:

Question answering systems improved dramatically when they could retrieve passages by meaning rather than keywords. Open-domain QA, where systems answer questions using large document collections, became practical with dense retrieval.

RAG systems, which we introduced in the previous chapters, depend critically on dense retrieval. The ability to find semantically relevant passages allows LLMs to answer questions about current events, specialized domains, and private data that wasn't in their training set.

Semantic search products from Google, Microsoft, and others now incorporate dense retrieval, improving results for conversational queries and complex information needs.

Cross-lingual applications became more practical because dense models can encode meaning across languages, enabling search systems where queries and documents may be in different languages.

The combination of transformer pre-training and contrastive learning for retrieval, which we'll explore in the next chapter, established the foundation for modern semantic search systems.

Summary

Dense retrieval represents a fundamental shift from lexical to semantic matching in information retrieval. Rather than counting term overlaps, dense retrievers encode queries and documents into continuous vector spaces where similarity reflects semantic relatedness.

The bi-encoder architecture enables scalable dense retrieval by separating query and document encoding. Documents can be pre-computed and indexed offline, requiring only a single query encoding at search time. This efficiency trade-off sacrifices some of the fine-grained interaction captured by cross-encoders but enables retrieval over millions of documents.

Embedding similarity metrics, particularly cosine similarity and dot product, quantify semantic relatedness between query and document vectors. The choice of metric should align with the model's training objective, though many models normalize embeddings, making the metrics equivalent.

Dense and sparse retrieval have complementary strengths. Dense retrieval excels at semantic matching and handles vocabulary mismatch gracefully, while sparse retrieval provides exact term matching, better handles rare terms, and offers more interpretable results. Production systems often combine both approaches in hybrid architectures.

Training dense retrievers requires query-document pairs and careful negative sampling. The contrastive learning objective pushes the model to maximize similarity between relevant pairs while minimizing similarity with hard negatives. The quality of negatives significantly impacts the learned embedding space.

In the next chapter, we'll dive deeper into contrastive learning for retrieval, examining the training objectives, loss functions, and techniques that produce effective dense retrieval models.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about dense retrieval and semantic matching.

Loading component...

Reference

BIBTEXAcademic
@misc{denseretrievalsemanticsearchbiencoderimplementation, author = {Michael Brenndoerfer}, title = {Dense Retrieval: Semantic Search & Bi-Encoder Implementation}, year = {2026}, url = {https://mbrenndoerfer.com/writing/dense-retrieval-semantic-search-bi-encoders}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). Dense Retrieval: Semantic Search & Bi-Encoder Implementation. Retrieved from https://mbrenndoerfer.com/writing/dense-retrieval-semantic-search-bi-encoders
MLAAcademic
Michael Brenndoerfer. "Dense Retrieval: Semantic Search & Bi-Encoder Implementation." 2026. Web. today. <https://mbrenndoerfer.com/writing/dense-retrieval-semantic-search-bi-encoders>.
CHICAGOAcademic
Michael Brenndoerfer. "Dense Retrieval: Semantic Search & Bi-Encoder Implementation." Accessed today. https://mbrenndoerfer.com/writing/dense-retrieval-semantic-search-bi-encoders.
HARVARDAcademic
Michael Brenndoerfer (2026) 'Dense Retrieval: Semantic Search & Bi-Encoder Implementation'. Available at: https://mbrenndoerfer.com/writing/dense-retrieval-semantic-search-bi-encoders (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). Dense Retrieval: Semantic Search & Bi-Encoder Implementation. https://mbrenndoerfer.com/writing/dense-retrieval-semantic-search-bi-encoders