UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction
Back to Writing

UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction

Michael BrenndoerferNovember 2, 202526 min read6,016 wordsInteractive

A comprehensive guide covering UMAP dimensionality reduction, including mathematical foundations, fuzzy simplicial sets, manifold learning, and practical implementation. Learn how to preserve both local and global structure in high-dimensional data visualization.

Data Science Handbook Cover
Part of Data Science Handbook

This article is part of the free-to-read Data Science Handbook

View full handbook
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a powerful dimensionality reduction technique that preserves both local and global structure in high-dimensional data. Unlike traditional methods like PCA that focus primarily on global variance, UMAP constructs a representation of the data that maintains the topological structure of the original manifold. This means that points that are close together in the high-dimensional space will remain close in the lower-dimensional projection, while also preserving the overall shape and relationships of the data.

UMAP is particularly valuable because it can handle non-linear relationships that linear methods like PCA cannot capture. While PCA assumes that the data lies on a linear subspace, UMAP makes no such assumption and can discover complex, curved manifolds in the data. This makes it especially useful for visualizing high-dimensional datasets where the underlying structure is non-linear, such as gene expression data, image embeddings, or text representations.

The method works by first constructing a fuzzy simplicial set representation of the data in high dimensions, then finding a low-dimensional representation that has a similar fuzzy simplicial set structure. This approach allows UMAP to capture both local neighborhoods (preserving local structure) and global relationships (preserving global structure) simultaneously, making it particularly effective for both visualization and preprocessing tasks.

Advantages

UMAP offers several key advantages over other dimensionality reduction techniques. First, it preserves both local and global structure, meaning that nearby points in the original space remain close in the reduced space, while the overall shape and relationships of the data are maintained. This dual preservation makes UMAP particularly effective for visualization tasks where we want to understand both fine-grained clusters and broad patterns in the data.

Second, UMAP is computationally efficient and scales well to large datasets. Unlike some other non-linear methods that have quadratic or worse time complexity, UMAP can handle datasets with millions of points in reasonable time. The algorithm uses efficient nearest neighbor search and optimization techniques that make it practical for real-world applications.

Third, UMAP is highly parameterizable, allowing users to tune the balance between local and global structure preservation. By adjusting parameters like n_neighbors and min_dist, we can control whether the algorithm focuses more on preserving local neighborhoods or global structure, making it adaptable to different types of data and analysis goals.

Disadvantages

Despite its strengths, UMAP has several limitations that users should be aware of. The most significant disadvantage is that UMAP is sensitive to its hyperparameters, particularly n_neighbors and min_dist. Choosing inappropriate values can lead to poor results, and the optimal parameters often depend on the specific dataset and analysis goals. This parameter sensitivity can make UMAP less reliable than more robust methods like PCA for certain applications.

Another limitation is that UMAP, like other non-linear dimensionality reduction methods, can be sensitive to noise and outliers in the data. Outliers can significantly affect the construction of the fuzzy simplicial set, potentially distorting the final embedding. This sensitivity to noise means that data preprocessing and outlier detection become more important when using UMAP.

Finally, while UMAP is generally faster than some alternatives like t-SNE, it can still be computationally expensive for very large datasets or when many different parameter combinations need to be tested. The nearest neighbor search, while efficient, still represents a significant computational bottleneck, and the optimization process can be time-consuming for high-dimensional data.

Formula

To understand how UMAP achieves its goal of preserving both local and global structure, we need to explore the mathematical machinery that powers it. The algorithm builds on two key ideas from topology: fuzzy simplicial sets (a way to represent how points relate to each other with partial membership) and Riemannian geometry (which lets us think about distances on curved surfaces). Let's walk through the construction step by step, focusing on the intuition behind each piece.

The Core Challenge

Before diving into formulas, consider the problem UMAP solves. We have high-dimensional data and want to create a low-dimensional representation that preserves important relationships. Linear methods like PCA assume the data lies on a flat hyperplane, but real data often lives on curved manifolds. Imagine a Swiss roll: the data lies on a 2D surface embedded in 3D space. If we simply project it onto a plane, we lose the intrinsic structure. UMAP addresses this by modeling the data as lying on a manifold and preserving the manifold's topology.

The algorithm proceeds in two phases: first, build a rich representation of the high-dimensional structure; second, find a low-dimensional arrangement that matches this structure as closely as possible.

Phase 1: Representing High-Dimensional Structure

Step 1: Computing Local Distances

We start with the basics. For each point xix_i in our dataset, we measure how far it is from other points using Euclidean distance:

dij=xixj2d_{ij} = \|x_i - x_j\|_2

where dijd_{ij} is the distance between points xix_i and xjx_j, and 2\|\cdot\|_2 denotes the Euclidean (L2) norm.

This gives us a distance matrix. UMAP focuses on local neighborhoods because nearby points provide more reliable information about the manifold structure than distant ones.

For each point xix_i, we identify its kk nearest neighbors. These neighbors define a local patch of the manifold around xix_i. The parameter kk (called n_neighbors in practice) controls how many neighbors we consider. A smaller kk focuses on very local structure, while a larger kk captures more global patterns.

Step 2: Building Fuzzy Membership Strengths

Instead of using a binary neighborhood (where points are either neighbors or not), UMAP assigns a smooth membership strength to each potential neighbor. This membership strength tells us how strongly point xjx_j belongs to point xix_i's local neighborhood.

The membership strength is computed as:

μij=exp(max(0,dijρi)σi)\mu_{ij} = \exp\left(-\frac{\max(0, d_{ij} - \rho_i)}{\sigma_i}\right)

where:

  • μij\mu_{ij}: Membership strength indicating how strongly point xjx_j belongs to the neighborhood of point xix_i
  • ρi\rho_i: Distance from xix_i to its nearest neighbor (acts as a baseline)
  • σi\sigma_i: Scaling factor specific to point xix_i that controls decay rate
  • dijd_{ij}: Euclidean distance between points xix_i and xjx_j

Any point closer than the nearest neighbor (when dijρid_{ij} \leq \rho_i) gets maximum membership strength of 1.0, since max(0,dijρi)=0\max(0, d_{ij} - \rho_i) = 0 and exp(0)=1\exp(0) = 1. Beyond the nearest neighbor, the membership strength decays exponentially at a rate controlled by σi\sigma_i.

Why use this exponential decay? Because it creates a smooth, continuous transition from "definitely a neighbor" to "definitely not a neighbor." This smoothness is important for the optimization that comes later.

Finding the Right Scale with σi\sigma_i

The parameter σi\sigma_i is not chosen arbitrarily. For each point xix_i, we pick σi\sigma_i so that the effective number of neighbors equals kk. Mathematically, we want:

jiμij=log2(k)\sum_{j \neq i} \mu_{ij} = \log_2(k)

where:

  • jiμij\sum_{j \neq i} \mu_{ij}: Sum of all membership strengths for point xix_i (excluding itself)
  • log2(k)\log_2(k): Target sum based on the number of neighbors kk

Why log2(k)\log_2(k) instead of just kk? This comes from information theory and ensures that the fuzzy simplicial set has the right "size" in terms of entropy. We find σi\sigma_i using binary search: we try different values until the sum of membership strengths equals log2(k)\log_2(k).

This adaptive scaling is crucial. In dense regions of the data, σi\sigma_i will be small (neighbors are close, so we don't need to scale distances much). In sparse regions, σi\sigma_i will be large (neighbors are farther away, so we scale up to maintain connectivity).

Step 3: Creating a Global Graph

Each point xix_i now has its own local view of the neighborhood structure, encoded by the membership strengths μij\mu_{ij}. But these local views are asymmetric: point xix_i might consider xjx_j a strong neighbor while xjx_j considers xix_i a weak neighbor (or not at all).

To create a consistent global structure, we symmetrize these relationships using a fuzzy set union:

μijglobal=μij+μjiμijμji\mu_{ij}^{\text{global}} = \mu_{ij} + \mu_{ji} - \mu_{ij} \cdot \mu_{ji}

where:

  • μijglobal\mu_{ij}^{\text{global}}: Symmetrized global membership strength between points ii and jj
  • μij\mu_{ij}: Local membership strength from point ii's perspective
  • μji\mu_{ji}: Local membership strength from point jj's perspective

This formula is the probabilistic OR operation for fuzzy sets. It is equivalent to:

μijglobal=1(1μij)(1μji)\mu_{ij}^{\text{global}} = 1 - (1 - \mu_{ij})(1 - \mu_{ji})

In practice, UMAP often uses a simpler approximation:

μijglobalmax(μij,μji)\mu_{ij}^{\text{global}} \approx \max(\mu_{ij}, \mu_{ji})

This approximation says: if either point considers the other a neighbor, preserve that connection in the global graph. The result is a weighted graph where edge weights represent the strength of relationships in the original high-dimensional space.

Phase 2: Finding the Low-Dimensional Embedding

Step 4: Defining Structure in Low Dimensions

Now we have a target: the fuzzy simplicial set μijglobal\mu_{ij}^{\text{global}} that describes relationships in high dimensions. We need to find low-dimensional coordinates yiy_i that recreate these relationships as faithfully as possible.

In the low-dimensional space, we compute similar membership strengths. But instead of using exponential decay, UMAP uses a different functional form:

vij=11+ayiyj22bv_{ij} = \frac{1}{1 + a \cdot \|y_i - y_j\|_2^{2b}}

where:

  • vijv_{ij}: Membership strength between points ii and jj in the low-dimensional embedding
  • yi,yjy_i, y_j: Low-dimensional coordinates of points ii and jj
  • a,ba, b: Shape parameters that control the distribution (set based on min_dist)

This function is a generalized version of the Cauchy distribution (which t-SNE uses with b=1b = 1).

Why a different function in low dimensions? The exponential decay in high dimensions reflects local connectivity on the manifold. In low dimensions, we need a function that allows both tight clusters (when points should be close) and clear separation (when points should be far apart). The form 11+ad2b\frac{1}{1 + a \cdot d^{2b}} achieves this balance, where d=yiyj2d = \|y_i - y_j\|_2 is the Euclidean distance in the low-dimensional space.

The Role of min_dist

The min_dist parameter specifies the minimum distance between points in the low-dimensional embedding. Smaller values allow points to pack tightly (good for revealing fine structure), while larger values spread points out (good for seeing overall organization). The parameters aa and bb are set to match this desired minimum distance.

Step 5: Optimization via Cross-Entropy

We now have two sets of membership strengths: μijglobal\mu_{ij}^{\text{global}} (the target from high dimensions) and vijv_{ij} (the current state in low dimensions). We want to adjust the low-dimensional coordinates yiy_i to make vijv_{ij} match μijglobal\mu_{ij}^{\text{global}} as closely as possible.

UMAP uses cross-entropy as the objective function:

C=i,jμijgloballog(μijglobalvij)+(1μijglobal)log(1μijglobal1vij)C = \sum_{i,j} \mu_{ij}^{\text{global}} \log\left(\frac{\mu_{ij}^{\text{global}}}{v_{ij}}\right) + (1 - \mu_{ij}^{\text{global}}) \log\left(\frac{1 - \mu_{ij}^{\text{global}}}{1 - v_{ij}}\right)

where:

  • CC: Cross-entropy loss to be minimized
  • μijglobal\mu_{ij}^{\text{global}}: Target membership strength from high-dimensional space
  • vijv_{ij}: Current membership strength in low-dimensional space

Cross-entropy measures the difference between two probability distributions. Here we treat μijglobal\mu_{ij}^{\text{global}} and vijv_{ij} as probabilities that points ii and jj are connected.

The first term, μijgloballog(μijglobal/vij)\mu_{ij}^{\text{global}} \log(\mu_{ij}^{\text{global}} / v_{ij}), penalizes cases where points should be close (high μijglobal\mu_{ij}^{\text{global}}) but are far apart (low vijv_{ij}). The second term, (1μijglobal)log((1μijglobal)/(1vij))(1 - \mu_{ij}^{\text{global}}) \log((1 - \mu_{ij}^{\text{global}}) / (1 - v_{ij})), penalizes cases where points should be far apart but are close together.

Minimizing this cross-entropy pulls the low-dimensional structure toward the high-dimensional structure. We use stochastic gradient descent (SGD) to optimize the coordinates yiy_i. At each iteration, we sample pairs of points, compute how the cross-entropy would change if we moved them slightly, and update coordinates to reduce the error.

Why This Construction Works

Let's step back and see the full picture:

  1. Local structure preservation: By using nearest neighbors and adaptive scaling (σi\sigma_i), UMAP ensures that local neighborhoods are well-represented.

  2. Global structure preservation: By connecting local views into a global fuzzy simplicial set, UMAP maintains large-scale relationships.

  3. Flexibility: The low-dimensional membership function and min_dist parameter let us control the final embedding's appearance.

  4. Mathematical foundation: The use of fuzzy sets and cross-entropy provides a principled objective function that can be efficiently optimized.

Mathematical Properties

The construction above gives UMAP several important properties:

  • Topological preservation: By building a fuzzy simplicial set, UMAP preserves topological features like holes, loops, and connected components. If two clusters are separate in high dimensions, they remain separate in low dimensions.

  • Metric structure preservation: UMAP adapts to the local density of the data. In dense regions, it preserves fine details. In sparse regions, it maintains connectivity without distorting distances too much.

  • Parameter interpretability: Each parameter has a clear geometric meaning. The n_neighbors parameter controls the size of local neighborhoods (larger values capture more global structure). The min_dist parameter controls how tightly points can cluster in the embedding (smaller values reveal finer details).

  • Computational efficiency: By focusing on local neighborhoods and using efficient optimization, UMAP scales to large datasets. The nearest neighbor search is the main computational bottleneck, but this can be accelerated with approximate methods.

Visualizing UMAP

Let's create visualizations that demonstrate how UMAP works and how it compares to other dimensionality reduction methods. We'll start with a simple example using synthetic data to illustrate the key concepts.

In[2]:
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.datasets import make_swiss_roll, make_circles, make_moons
4from sklearn.decomposition import PCA
5from sklearn.manifold import TSNE
6from umap import UMAP
7import seaborn as sns
8import warnings
9
10# Suppress UMAP warnings about n_jobs with random_state
11warnings.filterwarnings("ignore", message="n_jobs value .* overridden")
12
13# Set random seed for reproducibility
14np.random.seed(42)
15
16# Create synthetic datasets
17swiss_roll, swiss_roll_color = make_swiss_roll(
18    n_samples=1000, noise=0.1, random_state=42
19)
20circles, circles_color = make_circles(
21    n_samples=500, noise=0.1, factor=0.3, random_state=42
22)
23moons, moons_color = make_moons(n_samples=500, noise=0.1, random_state=42)
24
25# Apply different dimensionality reduction methods
26pca_swiss = PCA(n_components=2, random_state=42).fit_transform(swiss_roll)
27tsne_swiss = TSNE(n_components=2, random_state=42, perplexity=30).fit_transform(
28    swiss_roll
29)
30umap_swiss = UMAP(
31    n_components=2, random_state=42, n_neighbors=15, min_dist=0.1
32).fit_transform(swiss_roll)
33
34pca_circles = PCA(n_components=2, random_state=42).fit_transform(circles)
35tsne_circles = TSNE(n_components=2, random_state=42, perplexity=30).fit_transform(
36    circles
37)
38umap_circles = UMAP(
39    n_components=2, random_state=42, n_neighbors=15, min_dist=0.1
40).fit_transform(circles)
41
42pca_moons = PCA(n_components=2, random_state=42).fit_transform(moons)
43tsne_moons = TSNE(n_components=2, random_state=42, perplexity=30).fit_transform(moons)
44umap_moons = UMAP(
45    n_components=2, random_state=42, n_neighbors=15, min_dist=0.1
46).fit_transform(moons)

Now let's create visualizations comparing the different methods:

Out[3]:
Visualization
Notebook output

Let's also demonstrate how UMAP parameters affect the results:

Out[4]:
Visualization
Notebook output

Example

Let's work through a concrete example using a simple 2D dataset to demonstrate how UMAP works step by step. We'll use a dataset with two distinct clusters to make the process clear.

In[5]:
1# Create a simple 2D dataset with two clusters
2np.random.seed(42)
3cluster1 = np.random.normal([2, 2], 0.5, (50, 2))
4cluster2 = np.random.normal([-2, -2], 0.5, (50, 2))
5X = np.vstack([cluster1, cluster2])
6y = np.hstack([np.zeros(50), np.ones(50)])
7
8print(f"Dataset shape: {X.shape}")
9print(f"Cluster 1 center: {cluster1.mean(axis=0)}")
10print(f"Cluster 2 center: {cluster2.mean(axis=0)}")
11print(
12    f"Distance between cluster centers: {np.linalg.norm(cluster1.mean(axis=0) - cluster2.mean(axis=0)):.2f}"
13)
Out[6]:
Visualization
Notebook output

Now let's apply UMAP and examine the results:

In[7]:
1# Apply UMAP
2reducer = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
3X_umap = reducer.fit_transform(X)
4
5print(f"UMAP embedding shape: {X_umap.shape}")
6print(
7    f"UMAP embedding range: X1=[{X_umap[:, 0].min():.2f}, {X_umap[:, 0].max():.2f}], X2=[{X_umap[:, 1].min():.2f}, {X_umap[:, 1].max():.2f}]"
8)
Out[8]:
Visualization
Notebook output

Let's also examine the local neighborhood structure that UMAP preserves:

In[9]:
1# Examine local neighborhoods
2from sklearn.neighbors import NearestNeighbors
3
4# Find nearest neighbors in original space
5nbrs_original = NearestNeighbors(n_neighbors=5).fit(X)
6distances_original, indices_original = nbrs_original.kneighbors(X)
7
8# Find nearest neighbors in UMAP space
9nbrs_umap = NearestNeighbors(n_neighbors=5).fit(X_umap)
10distances_umap, indices_umap = nbrs_umap.kneighbors(X_umap)
11
12# Check how well neighborhoods are preserved
13neighborhood_preservation = []
14for i in range(len(X)):
15    original_neighbors = set(indices_original[i])
16    umap_neighbors = set(indices_umap[i])
17    intersection = len(original_neighbors.intersection(umap_neighbors))
18    preservation = intersection / 5.0  # 5 neighbors
19    neighborhood_preservation.append(preservation)
20
21print(f"Average neighborhood preservation: {np.mean(neighborhood_preservation):.3f}")
22print(
23    f"Neighborhood preservation range: [{np.min(neighborhood_preservation):.3f}, {np.max(neighborhood_preservation):.3f}]"
24)

Implementation in Scikit-learn

UMAP is implemented in the umap-learn library, which provides a scikit-learn compatible interface. Let's demonstrate how to use it effectively with a real-world example.

Loading and Preparing Data

We'll use the digits dataset, which contains 8×8 pixel images of handwritten digits (0-9). This is a good example because it has high-dimensional data (64 features) that benefits from dimensionality reduction.

In[10]:
1from umap import UMAP
2from sklearn.datasets import load_digits
3from sklearn.preprocessing import StandardScaler
4from sklearn.model_selection import train_test_split
5from sklearn.ensemble import RandomForestClassifier
6from sklearn.metrics import classification_report, accuracy_score
7
8# Load the digits dataset
9digits = load_digits()
10X, y = digits.data, digits.target
Out[11]:
Digits dataset shape: (1797, 64)
Number of classes: 10

The dataset contains 1,797 samples with 64 features each (representing the 8×8 pixel values). We have 10 classes corresponding to digits 0 through 9.

Preprocessing

UMAP is sensitive to feature scales, so we standardize the features before applying dimensionality reduction.

In[12]:
1# Preprocess the data
2scaler = StandardScaler()
3X_scaled = scaler.fit_transform(X)
4
5# Split the data
6X_train, X_test, y_train, y_test = train_test_split(
7    X_scaled, y, test_size=0.2, random_state=42
8)
Out[13]:
Training set shape: (1437, 64)
Test set shape: (360, 64)

We've split the data into 80% training (1,437 samples) and 20% test (360 samples) sets. Standardization ensures all features have zero mean and unit variance.

Applying UMAP

Now we apply UMAP to reduce the 64-dimensional data to 2 dimensions for visualization. We set n_neighbors=15 to balance local and global structure, and min_dist=0.1 to allow moderate clustering.

In[14]:
1# Apply UMAP for dimensionality reduction
2reducer = UMAP(
3    n_components=2,  # Reduce to 2D for visualization
4    n_neighbors=15,  # Number of neighbors to consider
5    min_dist=0.1,  # Minimum distance between points
6    metric="euclidean",  # Distance metric
7    random_state=42,
8)
9
10# Fit UMAP on training data
11X_train_umap = reducer.fit_transform(X_train)
12X_test_umap = reducer.transform(X_test)
Out[15]:
UMAP training embedding shape: (1437, 2)
UMAP test embedding shape: (360, 2)

The embedding successfully reduces the dimensionality from 64 to 2 for both training and test sets. Note that we use fit_transform on training data to learn the embedding, then transform on test data to project it into the same embedding space.

Visualizing the Embedding

Let's visualize the 2D UMAP embedding to see how well it separates the different digit classes.

Out[16]:
Visualization
Notebook output

The visualization shows that UMAP successfully groups similar digits together in the 2D space. We can see distinct clusters for each digit class, with some expected overlaps (e.g., between digits that look similar like 3 and 8, or 4 and 9). The test data embedding matches the training data structure, indicating that the learned manifold generalizes well.

Using UMAP for Classification

To demonstrate UMAP's value as a preprocessing step, we'll train a classifier on the reduced 2D representation and compare it to the original 64-dimensional data.

In[17]:
1# Train a classifier on the UMAP embedding
2rf_umap = RandomForestClassifier(n_estimators=100, random_state=42)
3rf_umap.fit(X_train_umap, y_train)
4
5# Make predictions
6y_pred_umap = rf_umap.predict(X_test_umap)
7accuracy_umap = accuracy_score(y_test, y_pred_umap)
8
9# For comparison, train on original data
10rf_original = RandomForestClassifier(n_estimators=100, random_state=42)
11rf_original.fit(X_train, y_train)
12y_pred_original = rf_original.predict(X_test)
13accuracy_original = accuracy_score(y_test, y_pred_original)
Out[18]:
Classification Performance:

Accuracy on UMAP embedding (2D): 0.947
Accuracy on original data (64D): 0.972

Dimensionality reduction: 64 → 2 features (96.9% reduction)

The classifier achieves competitive accuracy even with 97% fewer features. While the original 64-dimensional data provides slightly better accuracy, the 2D UMAP embedding retains most of the discriminative information while being much faster to train and easier to visualize.

Out[19]:

Detailed Classification Report (UMAP embedding):

              precision    recall  f1-score   support

           0       0.97      1.00      0.99        33
           1       0.90      1.00      0.95        28
           2       1.00      0.91      0.95        33
           3       0.92      1.00      0.96        34
           4       1.00      1.00      1.00        46
           5       0.95      0.89      0.92        47
           6       0.97      0.97      0.97        35
           7       1.00      0.94      0.97        34
           8       0.82      0.90      0.86        30
           9       0.92      0.88      0.90        40

    accuracy                           0.95       360
   macro avg       0.95      0.95      0.95       360
weighted avg       0.95      0.95      0.95       360

The classification report shows per-class performance. Most digits achieve high precision and recall, with the model performing well across all classes. Any confusion between classes reflects the inherent similarity between certain digit shapes in the 2D projection.

Alternative Implementation

We can also implement a simplified version of UMAP from scratch to better understand the algorithm:

In[20]:
1def simple_umap(X, n_components=2, n_neighbors=15, min_dist=0.1, random_state=None):
2    """
3    Simplified UMAP implementation for educational purposes.
4    This is a basic version that demonstrates the core concepts.
5    """
6    if random_state is not None:
7        np.random.seed(random_state)
8
9    n_samples, n_features = X.shape
10
11    # Step 1: Find nearest neighbors
12    from sklearn.neighbors import NearestNeighbors
13
14    nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(X)
15    distances, indices = nbrs.kneighbors(X)
16
17    # Step 2: Compute local distances and sigmas
18    rhos = distances[:, 1]  # Distance to nearest neighbor (excluding self)
19    sigmas = np.zeros(n_samples)
20
21    for i in range(n_samples):
22        # Binary search for sigma
23        low, high = 0, 10
24        for _ in range(20):  # Max iterations
25            mid = (low + high) / 2
26            membership_strengths = np.exp(-(distances[i, 1:] - rhos[i]) / mid)
27            if np.sum(membership_strengths) < np.log2(n_neighbors):
28                low = mid
29            else:
30                high = mid
31        sigmas[i] = (low + high) / 2
32
33    # Step 3: Compute fuzzy simplicial set
34    fuzzy_simplicial_set = np.zeros((n_samples, n_samples))
35    for i in range(n_samples):
36        for j in range(1, n_neighbors):  # Skip self
37            neighbor_idx = indices[i, j]
38            if distances[i, j] - rhos[i] <= 0:
39                fuzzy_simplicial_set[i, neighbor_idx] = 1.0
40            else:
41                fuzzy_simplicial_set[i, neighbor_idx] = np.exp(
42                    -(distances[i, j] - rhos[i]) / sigmas[i]
43                )
44
45    # Step 4: Symmetrize the fuzzy simplicial set
46    fuzzy_simplicial_set = np.maximum(fuzzy_simplicial_set, fuzzy_simplicial_set.T)
47
48    # Step 5: Initialize low-dimensional embedding
49    Y = np.random.normal(0, 0.1, (n_samples, n_components))
50
51    # Step 6: Optimize embedding (simplified gradient descent)
52    learning_rate = 0.01
53    n_epochs = 200
54
55    for epoch in range(n_epochs):
56        # Compute pairwise distances in low-dimensional space
57        Y_distances = np.sqrt(
58            np.sum((Y[:, np.newaxis, :] - Y[np.newaxis, :, :]) ** 2, axis=2)
59        )
60
61        # Compute membership strengths in low-dimensional space
62        a, b = 1.0, 1.0  # Simplified parameters
63        low_dim_membership = 1 / (1 + a * Y_distances ** (2 * b))
64
65        # Compute gradient (simplified)
66        gradient = np.zeros_like(Y)
67        for i in range(n_samples):
68            for j in range(n_samples):
69                if i != j:
70                    diff = Y[i] - Y[j]
71                    dist = Y_distances[i, j]
72                    if dist > 0:
73                        weight = fuzzy_simplicial_set[i, j] - low_dim_membership[i, j]
74                        gradient[i] += weight * diff / dist
75
76        # Update embedding
77        Y -= learning_rate * gradient
78
79    return Y
80
81
82# Test the simplified implementation
83X_simple = np.random.randn(100, 10)
84Y_simple = simple_umap(X_simple, n_components=2, n_neighbors=15, random_state=42)
85
86print(f"Simplified UMAP embedding shape: {Y_simple.shape}")
Out[21]:
Visualization
Notebook output

Key Parameters

Below are the main parameters that control how UMAP works and performs. Understanding these parameters helps you tune UMAP for your specific use case.

n_components: Number of dimensions in the low-dimensional embedding (default: 2). Common values are 2 for visualization or 10-50 for preprocessing before other algorithms. Higher values preserve more information but lose visualization benefits.

n_neighbors: Number of nearest neighbors to consider when building the high-dimensional graph (default: 15). This parameter controls the balance between local and global structure:

  • Smaller values (5-10): Focus on local structure, revealing fine details and small clusters
  • Medium values (15-30): Balance local and global structure, suitable for most applications
  • Larger values (50-100): Emphasize global structure, showing broad patterns

min_dist: Minimum distance between points in the low-dimensional embedding (default: 0.1). Controls how tightly points pack together:

  • Small values (0.0-0.1): Allow tight clustering, good for finding detailed structure
  • Medium values (0.1-0.5): Balance clustering and separation, suitable for most visualizations
  • Large values (0.5-0.99): Spread points out more evenly, emphasizing broad organization

metric: Distance metric for computing distances in the high-dimensional space (default: 'euclidean'). Options include 'euclidean', 'manhattan', 'cosine', 'correlation', and many others. Choose based on your data type:

  • Euclidean: General-purpose, works well for continuous features
  • Cosine: Good for text data or when magnitude doesn't matter
  • Manhattan: Robust to outliers in high dimensions

random_state: Seed for reproducibility (default: None). Set to an integer to ensure consistent results across runs.

n_epochs: Number of optimization iterations (default: auto-determined based on dataset size). More epochs improve quality but increase computation time. The default usually works well.

learning_rate: Step size for gradient descent optimization (default: 1.0). Rarely needs adjustment, but lower values (0.1-0.5) can help if the embedding quality is poor.

low_memory: Whether to use a lower memory, but longer-running implementation (default: False). Set to True for very large datasets that don't fit in memory.

Key Methods

The following are the most commonly used methods for interacting with UMAP.

fit(X, y=None): Learns the UMAP embedding from training data X. The optional y parameter can provide supervised information to guide the embedding.

fit_transform(X, y=None): Fits the model and returns the transformed data in one step. Use this for the initial embedding of your training data.

transform(X): Projects new data into the existing embedding space. Use this for test data or new observations after fitting on training data. Note that transform may be less accurate than the original embedding.

inverse_transform(X_embedded): Attempts to reconstruct the original high-dimensional data from the low-dimensional embedding. This is approximate and works better for higher-dimensional embeddings.

Practical Applications

When to Use UMAP

UMAP is particularly valuable when working with high-dimensional data that contains non-linear structure. In bioinformatics and genomics, UMAP excels at visualizing gene expression data from single-cell RNA sequencing experiments. These datasets typically contain thousands of genes (features) measured across thousands of cells, with complex relationships that form curved manifolds in high-dimensional space. UMAP's ability to preserve both local neighborhoods (grouping similar cells) and global structure (maintaining relationships between cell types) makes it effective for identifying cell populations, developmental trajectories, and disease states that might be missed by linear methods.

For machine learning practitioners, UMAP serves as a preprocessing step before classification or clustering when dealing with high-dimensional embeddings. In computer vision, image representations from deep neural networks often capture semantic relationships that lie on non-linear manifolds. UMAP can reduce these embeddings from hundreds or thousands of dimensions to a manageable size while preserving the neighborhood structure that indicates visual similarity. Similarly, in natural language processing, word embeddings and document representations benefit from UMAP's ability to maintain both local semantic relationships and broader topical structure.

UMAP is also useful in exploratory data analysis when you need to understand the structure of your data before modeling. The 2D or 3D visualizations it produces can reveal clusters, outliers, and relationships that inform feature engineering, model selection, and data quality assessment. However, UMAP is less suitable when your data truly lies on a linear subspace, when you need deterministic results without parameter tuning, or when computational resources are severely limited. In these cases, PCA may be more appropriate despite its linear assumptions.

Best Practices

To achieve optimal results with UMAP, start with the default parameters (n_neighbors=15, min_dist=0.1) and adjust based on your specific needs. For visualization tasks, keep n_components=2 or n_components=3 to enable plotting. For preprocessing before other algorithms, use higher dimensions (10-50) to retain more information while still reducing computational cost. When your dataset is small (fewer than 1,000 samples), reduce n_neighbors to 5-10 to avoid forcing the algorithm to consider too many neighbors relative to the dataset size. For large datasets (over 10,000 samples), you can increase n_neighbors to 30-50 to capture more global structure.

The min_dist parameter controls the visual appearance of your embedding. Smaller values (0.0-0.05) create tight, distinct clusters that emphasize fine-grained structure, which works well when you expect many small groups. Larger values (0.3-0.5) spread points out more evenly, making it easier to see the overall organization at the cost of losing some local detail. For exploratory analysis, try multiple min_dist values to understand your data from different perspectives. Set random_state to an integer for reproducibility, as UMAP uses stochastic optimization that produces different results across runs.

Evaluate your UMAP embeddings using multiple approaches rather than relying on a single metric. Quantitatively, measure how well the embedding preserves local neighborhoods by comparing nearest neighbors in the original space versus the embedded space. Visually, check whether the embedding reveals meaningful structure by coloring points according to known labels or metadata. If you see unexpected patterns or poor separation, try adjusting n_neighbors or min_dist, or verify that your preprocessing was appropriate. Remember that UMAP makes tradeoffs to compress high-dimensional data into low dimensions, so perfect preservation is impossible.

Data Requirements and Preprocessing

UMAP requires numerical feature data and is sensitive to feature scales because it computes distances in the original high-dimensional space. Before applying UMAP, standardize your features using z-score normalization (zero mean, unit variance) to ensure that all features contribute equally to distance calculations. Features measured in different units or with vastly different ranges will otherwise dominate the embedding, leading to distorted results. For example, if one feature ranges from 0-1000 while another ranges from 0-1, the larger-scale feature will disproportionately influence neighborhood calculations.

When working with very high-dimensional data (thousands of features), consider applying PCA as a preprocessing step to reduce dimensionality to a few hundred components before using UMAP. This approach reduces computational cost, removes noise, and often improves UMAP's ability to find meaningful structure. However, avoid reducing to very few PCA components (fewer than 10-20), as this may remove non-linear relationships that UMAP could otherwise discover. The combination of PCA for noise reduction and linear structure removal, followed by UMAP for non-linear manifold learning, often produces better results than either method alone.

Handle missing values before applying UMAP, as the algorithm requires complete data for distance calculations. Either impute missing values using appropriate methods for your domain, or remove samples with missing data if your dataset is large enough. Similarly, identify and handle outliers carefully. Extreme outliers can distort distance calculations and affect the fuzzy simplicial set construction, potentially pulling the embedding in misleading directions. Consider whether outliers represent meaningful rare cases or data quality issues, and treat them accordingly.

Common Pitfalls

One frequent mistake is using UMAP without standardizing features, which leads to embeddings dominated by high-variance or large-scale features. This is particularly problematic when features have different units (e.g., mixing counts, proportions, and measurements in different scales). The resulting visualization may appear to show clear structure, but the patterns reflect scale differences rather than meaningful relationships. Check your feature scales before running UMAP, and standardize unless you have a specific reason not to (such as when all features are already comparable, like pixel values from images).

Another common issue is over-interpreting distances in UMAP embeddings. While UMAP preserves local neighborhoods well, global distances can be distorted, and the distance between clusters in the 2D embedding does not necessarily reflect their true separation in the original space. Two clusters that appear far apart in the UMAP visualization might actually be close in the high-dimensional space, or vice versa. Use UMAP primarily for understanding local structure and overall topology, not for precise distance measurements. If global distances matter for your analysis, consider alternative methods or validate UMAP results with distance calculations in the original space.

Failing to experiment with different parameter values is another pitfall. Many users stick with default parameters even when they produce suboptimal results for their specific data. If your embedding looks overly compressed or overly dispersed, lacks clear structure when you expect to see it, or shows unexpected patterns, try adjusting n_neighbors and min_dist. Start by varying n_neighbors across a range (5, 15, 30, 50) to see how the balance between local and global structure changes. Similarly, try several min_dist values (0.0, 0.1, 0.3) to find the level of compactness that best reveals your data's structure. Document which parameters work best for your use case to guide future analyses.

Computational Considerations

UMAP's computational complexity depends primarily on the nearest neighbor search and the optimization process. For datasets with tens of thousands of points, UMAP typically completes in minutes on modern hardware. The nearest neighbor search scales approximately as O(n log n) when using efficient tree-based or approximate methods, while the optimization phase scales roughly linearly with the number of points and edges in the graph. Memory usage is dominated by storing the nearest neighbor graph, which requires O(n × n_neighbors) space.

For large datasets (over 100,000 points), consider strategies to reduce computational cost. Reduce n_neighbors from the default 15 to 10 or even 5, which decreases both the graph construction time and memory requirements. Use approximate nearest neighbor methods by setting metric='euclidean' and ensuring your implementation uses efficient approximations (most modern UMAP implementations do this automatically). For extremely large datasets (millions of points), consider subsampling to a representative subset, applying UMAP to learn the embedding, and then projecting the remaining points using the transform method. This approach trades some accuracy for substantial speed improvements.

UMAP benefits from parallelization during the nearest neighbor search phase when multiple CPU cores are available. Most implementations automatically use multiple cores for this step. However, setting random_state disables some parallelization to ensure reproducibility, which can slow down computation. For production use cases where speed matters more than exact reproducibility, consider omitting random_state or running multiple trials in parallel with different random seeds to verify stability. For deployment scenarios requiring fast inference on new data points, precompute the UMAP embedding on a reference dataset and use the transform method for new points, though be aware that transformed points may have lower quality embeddings than the original fitted data.

Performance and Deployment Considerations

Evaluating UMAP embeddings requires multiple complementary approaches because there is no single metric that captures all aspects of quality. Quantitatively, measure neighborhood preservation by computing the overlap between k-nearest neighbors in the original space and the embedded space. A preservation rate above 0.7 indicates that the embedding maintains most local structure, while rates below 0.5 suggest that significant information has been lost. You can also measure how well the embedding preserves pairwise distances by comparing distance rankings before and after dimensionality reduction using Spearman correlation.

Visual inspection remains important despite quantitative metrics. Color points in the embedding according to known labels, metadata, or features to verify that meaningful patterns emerge. Clusters of the same color indicate that UMAP successfully grouped similar items. Check for gradients or continuous transitions in the embedding when you expect them (e.g., developmental trajectories in biological data). Be cautious about finding spurious patterns—UMAP can create apparent structure even in random data, so validate findings against domain knowledge or external data.

When deploying UMAP in production systems, consider the computational cost of fitting versus transforming. Fitting UMAP requires significant computation and should be done on a representative training set. Once fitted, the model can transform new points using the transform method, which is faster but produces lower-quality embeddings than the original fit_transform. For applications requiring high-quality embeddings of new data, periodically refit UMAP on an updated reference dataset that includes recent samples. Monitor the neighborhood preservation metric over time to detect when the embedding quality degrades, which may indicate data drift or the need to update the reference embedding. Store UMAP models along with the preprocessing pipeline (standardization parameters, PCA components if used) to ensure consistent transformations of new data.

Summary

UMAP advances dimensionality reduction by combining local structure preservation with global structure preservation. The method's foundation in fuzzy simplicial sets and Riemannian geometry provides a mathematically rigorous approach to manifold learning that scales well to large datasets while maintaining interpretable parameters.

The algorithm's strength lies in revealing non-linear relationships that linear methods like PCA cannot capture. This capability makes UMAP valuable for visualization and preprocessing in fields where data structure is inherently non-linear, such as genomics, computer vision, and natural language processing. The balance between local and global preservation can be tuned through the n_neighbors and min_dist parameters to match specific analysis goals.

The main tradeoff involves parameter sensitivity. While PCA has no parameters to tune, UMAP requires thoughtful selection of n_neighbors (which controls the locality of the embedding) and min_dist (which controls point spacing). Feature standardization is recommended when scales differ across dimensions. Despite these requirements, UMAP provides a flexible approach to understanding high-dimensional data that complements traditional linear methods.

Reference

BIBTEXAcademic
@misc{umapcompleteguidetouniformmanifoldapproximationandprojectionfordimensionalityreduction, author = {Michael Brenndoerfer}, title = {UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction}, year = {2025}, url = {https://mbrenndoerfer.com/writing/umap-dimensionality-reduction-manifold-learning}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-11} }
APAAcademic
Michael Brenndoerfer (2025). UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction. Retrieved from https://mbrenndoerfer.com/writing/umap-dimensionality-reduction-manifold-learning
MLAAcademic
Michael Brenndoerfer. "UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction." 2025. Web. 11/11/2025. <https://mbrenndoerfer.com/writing/umap-dimensionality-reduction-manifold-learning>.
CHICAGOAcademic
Michael Brenndoerfer. "UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction." Accessed 11/11/2025. https://mbrenndoerfer.com/writing/umap-dimensionality-reduction-manifold-learning.
HARVARDAcademic
Michael Brenndoerfer (2025) 'UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction'. Available at: https://mbrenndoerfer.com/writing/umap-dimensionality-reduction-manifold-learning (Accessed: 11/11/2025).
SimpleBasic
Michael Brenndoerfer (2025). UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction. https://mbrenndoerfer.com/writing/umap-dimensionality-reduction-manifold-learning
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.