A comprehensive guide to multinomial logistic regression covering mathematical foundations, softmax function, coefficient estimation, and practical implementation in Python with scikit-learn.

This article is part of the free-to-read Data Science Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Multinomial Logistic Regression
Multinomial logistic regression is a powerful extension of binary logistic regression that allows us to model relationships between multiple categorical outcomes and predictor variables. While binary logistic regression handles situations with exactly two possible outcomes (like "yes" or "no"), multinomial logistic regression extends this capability to handle three or more mutually exclusive categories. This makes it invaluable for classification problems where we need to predict which of several possible classes an observation belongs to.
The key insight behind multinomial logistic regression is that we can model the probability of each category relative to a reference category. Unlike binary logistic regression, which uses a single logistic function, multinomial logistic regression uses multiple logistic functions simultaneously. Each function represents the log-odds of one category compared to the reference category, creating a system of equations that must be solved together.
This approach is particularly useful in fields like marketing (predicting customer segments), medicine (diagnosing disease types), and natural language processing (classifying text into different topics). The model assumes that the categories are mutually exclusive and collectively exhaustive, meaning each observation must belong to exactly one category, and all possible categories are included in the model.
Advantages
Multinomial logistic regression offers several compelling advantages over alternative classification methods. First, it provides interpretable probability estimates for each class, not just the predicted class. This probabilistic output is crucial in many applications where we need to understand the confidence of our predictions or make decisions based on uncertainty. For example, in medical diagnosis, knowing that a patient has a 60% chance of having condition A, 30% chance of condition B, and 10% chance of condition C is much more informative than simply predicting condition A.
Second, the model naturally handles the constraint that all class probabilities must sum to one, ensuring that our predictions are mathematically consistent. This built-in constraint prevents the common problem of having probabilities that don't add up correctly, which can occur with other multi-class approaches that treat each class independently. Additionally, multinomial logistic regression can handle both categorical and continuous predictor variables seamlessly, making it flexible for diverse datasets.
Finally, the model provides direct interpretability through odds ratios, allowing us to understand how changes in predictor variables affect the relative probability of different outcomes. This interpretability is particularly valuable in fields where understanding the relationship between variables is as important as making accurate predictions.
Disadvantages
Despite its strengths, multinomial logistic regression has several limitations that practitioners should consider. The most significant disadvantage is the assumption of independence of irrelevant alternatives (IIA), which states that the relative probabilities of any two alternatives should not change when a third alternative is added or removed. This assumption can be violated in many real-world scenarios, particularly when alternatives are similar or when there are unobserved factors that affect multiple alternatives similarly.
Another limitation is the model's sensitivity to the choice of reference category. While the choice of reference category doesn't affect the overall model fit, it can significantly impact the interpretation of coefficients and odds ratios. This can make model interpretation more complex, especially when communicating results to stakeholders who may not be familiar with the reference category concept.
Additionally, multinomial logistic regression can become computationally intensive as the number of categories increases, since we need to estimate parameters for each category relative to the reference. The model also assumes linear relationships between the log-odds and predictor variables, which may not hold in all situations. When these assumptions are violated, the model's performance can degrade significantly, and alternative approaches like decision trees or neural networks might be more appropriate.
Formula
The mathematical foundation of multinomial logistic regression builds upon the principles of binary logistic regression, but extends them to handle multiple categories. Let's walk through the derivation step by step, starting with the most intuitive form and progressing to the matrix notation.
Basic Probability Model
We begin with the fundamental assumption that we have categories (where ) and want to model the probability that observation belongs to category . Let's denote this probability as , where represents the vector of predictor variables for observation .
The key insight is that we can model the log-odds of each category relative to a reference category. Without loss of generality, let's choose category as our reference category. The log-odds of category relative to category is:
where:
- is the probability that observation belongs to category given its predictor values (the conditional probability we're modeling for the non-reference category)
- is the probability that observation belongs to the reference category (the baseline category against which all others are compared)
- is the intercept for category (the baseline log-odds when all predictors are zero)
- is the coefficient for predictor in category (quantifies the effect of predictor on the log-odds of category relative to the reference category)
- is the value of predictor for observation (the observed feature value)
- is the total number of predictor variables (the dimensionality of the feature space)
- is the reference category (chosen arbitrarily, typically the last category, with all its coefficients set to zero for identifiability)
From Log-Odds to Probabilities
To convert these log-odds back to probabilities, we need to solve for . Let's define the linear predictor for category as:
where:
- is the linear predictor for observation and category (a real-valued quantity that can take any value from to , representing the log-odds before normalization)
Then we can write:
Taking the exponential of both sides:
This gives us:
Normalization Constraint
Since all probabilities must sum to one, we have:
Substituting our expression for :
Since is common to all terms, we can factor it out:
This gives us the probability of the reference category:
Final Probability Formula
Now we can express the probability of any category as:
This is the softmax function, which ensures that all probabilities sum to one and are non-negative. Notice that when , this reduces to the standard logistic regression formula.
The softmax function transforms linear predictors (η values) into probabilities. As one linear predictor increases, its corresponding probability increases while others decrease proportionally. The transformation ensures all probabilities remain between 0 and 1 and sum to exactly 1.
This heatmap shows how the softmax function distributes probability mass across three classes as their linear predictors vary. When one class has a much larger η value, it captures most of the probability mass, demonstrating the competitive nature of the softmax transformation.
Matrix Notation
For computational efficiency and clarity, we can express the multinomial logistic regression model in matrix notation.
Let be the design matrix, where is the number of observations and is the number of predictors (the extra column accounts for the intercept). For each non-reference category (where ), let be a vector of coefficients. We stack these coefficient vectors horizontally to form a coefficient matrix:
where:
- is the design matrix (includes a column of 1s for the intercept term, allowing matrix multiplication to compute all linear predictors)
- is the number of observations in the dataset (the sample size)
- is the number of predictor variables (the number of features, excluding the intercept)
- is the coefficient vector for category (includes the intercept as the first element, followed by slope coefficients)
- is the coefficient matrix for all non-reference categories (each column is a coefficient vector for one non-reference category)
- is the total number of categories (including the reference category)
The linear predictors for all observations and all non-reference categories can be computed as:
where:
- is the matrix of linear predictors (each row corresponds to one observation, each column corresponds to one non-reference category, and element is the linear predictor for observation and category )
To obtain the predicted probabilities for all categories, we apply the softmax function row-wise. For each observation and category :
where for the reference category , we set (since its coefficients are all zero by convention).
In matrix notation, the full probability matrix , with elements , is given by:
where:
- is the probability matrix (each row contains the predicted probabilities for one observation across all categories)
- is the probability that observation belongs to category (an element of the probability matrix)
- is the matrix formed by horizontally concatenating (the matrix of linear predictors for non-reference categories) with (a column of zeros for the reference category)
- is the column vector of zeros for the reference category (representing for all observations)
- is the softmax function applied row-wise (transforms linear predictors into probabilities that are non-negative and sum to 1 for each observation)
This formulation ensures that all predicted probabilities are non-negative and sum to one for each observation.
This matrix-based approach is highly efficient for computation, especially when working with large datasets or when fitting the model using optimization algorithms that require gradients with respect to all parameters.
Calculating the Coefficients
The coefficients in multinomial logistic regression are estimated using maximum likelihood estimation (MLE). This process involves finding the parameter values that maximize the likelihood of observing our data given the model.
Likelihood Function
For a dataset with observations and classes, the likelihood function is:
where:
- is the likelihood function (the probability of observing the entire dataset given the model parameters )
- represents all coefficient parameters in the model (the collection of all values for all non-reference categories and all predictors)
- is the number of observations (the sample size)
- is the number of categories (the total number of possible outcome classes)
- is the predicted probability that observation belongs to category (computed using the softmax function)
- is the indicator function that equals 1 if observation truly belongs to class , and 0 otherwise (selects the probability of the observed class)
Log-Likelihood Function
Working with the log-likelihood is computationally more convenient:
where:
- is the log-likelihood function (the natural logarithm of the likelihood function, which transforms products into sums for computational convenience and numerical stability)
Substituting the softmax probability formula:
Gradient and Hessian
To find the maximum likelihood estimates, we need the gradient (first derivatives) and Hessian (second derivatives) of the log-likelihood function.
Let's break down how to compute the gradient and Hessian of the multinomial logistic regression log-likelihood step by step.
Step 1: Recall the log-likelihood
The log-likelihood for multinomial logistic regression is:
where is 1 if observation belongs to class , and 0 otherwise.
Step 2: Substitute the softmax probability
Recall that
where:
- is the linear predictor for observation and category (computed as the dot product of the feature vector and coefficient vector)
- is the transpose of the feature vector for observation (a row vector including a 1 for the intercept)
- is the coefficient vector for category (a column vector including the intercept and slope coefficients)
Step 3: Compute the gradient (first derivative) with respect to
We want to find the derivative of the log-likelihood with respect to a particular coefficient (the coefficient for predictor in class ):
-
The log-likelihood for a single observation is:
-
The derivative with respect to is:
-
Since depends on all , but only directly on through , we use the chain rule.
-
The derivative of with respect to is:
(This comes from differentiating the log-softmax.)
-
Summing over all observations, we get:
Step 4: Compute the Hessian (second derivative) with respect to and
- The Hessian entry for parameters and is the second derivative:
- Differentiating the gradient again, we get: where is 1 if , and 0 otherwise.
Results:
-
Gradient (First Derivatives):
where:
- is the partial derivative of the log-likelihood with respect to coefficient (the gradient component for this parameter)
- is the coefficient for predictor in category (the parameter we're taking the derivative with respect to)
- is the value of predictor for observation (the feature value that multiplies this coefficient)
- is the indicator function (equals 1 if observation truly belongs to category , and 0 otherwise)
- is the predicted probability for observation in category (computed using the current parameter values)
-
Hessian (Second Derivatives):
where:
- is the second partial derivative of the log-likelihood (measures the curvature of the log-likelihood surface with respect to parameters and )
- is the coefficient for predictor in category (the second parameter in the mixed partial derivative)
- is the value of predictor for observation (the feature value that multiplies coefficient )
- is the indicator function (equals 1 if categories and are the same, and 0 otherwise)
Optimization Methods
The coefficients are found by maximizing the log-likelihood function using numerical optimization methods:
- Newton-Raphson Method: Uses both gradient and Hessian information for faster convergence
- Fisher Scoring: Uses the expected Fisher information matrix instead of the observed Hessian
- Gradient Descent: Iteratively updates parameters in the direction of the gradient
- Limited-memory BFGS (L-BFGS): A quasi-Newton method that approximates the Hessian
Iterative Reweighted Least Squares (IRLS)
Multinomial logistic regression can also be solved using IRLS, which reformulates the problem as a series of weighted least squares problems. This approach:
- Starts with initial coefficient estimates
- Computes fitted probabilities using the current coefficients
- Creates working responses and weights
- Solves a weighted least squares problem
- Updates the coefficients
- Repeats until convergence
Convergence Criteria
The optimization process continues until one of the following convergence criteria is met:
- Parameter convergence:
- Log-likelihood convergence:
- Gradient convergence:
where:
- is the vector of coefficient estimates at iteration (the current parameter values during optimization)
- is the vector of coefficient estimates at iteration (the updated parameter values after one optimization step)
- is the log-likelihood value at iteration (the objective function value at the current parameters)
- is the log-likelihood value at iteration (the objective function value at the updated parameters)
- is the gradient vector at iteration (the vector of partial derivatives of the log-likelihood with respect to all parameters)
- is the tolerance threshold (a small positive number, typically or , that determines when convergence is achieved)
- denotes the absolute value (for scalars) or norm (for vectors), measuring the magnitude of change
Identifiability and Constraints
The multinomial logistic regression model is said to be overparameterized because, without any constraints, there are more parameters in the model than are actually needed to describe the relationship between the predictors and the outcome categories. Specifically, for outcome categories, the model estimates a separate set of coefficients for each category, but only of these sets are necessary to uniquely determine the predicted probabilities. This redundancy means that different sets of coefficients can produce the same predicted probabilities, making the model unidentifiable unless we impose a constraint.
To resolve this, the standard approach is to set the coefficients for one of the categories—called the reference category—to zero:
where:
- is the coefficient vector for the reference category (a vector set to all zeros to ensure model identifiability)
- is the zero vector (a vector with all elements equal to zero)
By doing this, we remove the redundancy and ensure that the model is identifiable, meaning the optimization problem has a unique solution and the estimated coefficients are interpretable relative to the reference category.
Mathematical Properties
The multinomial logistic regression model has several important mathematical properties. First, the softmax function ensures that all predicted probabilities are between 0 and 1 and sum to 1, which is necessary for a valid probability model. Second, the model is identifiable only up to the choice of reference category - we can add any constant to all coefficients without changing the predicted probabilities, which is why we typically set the coefficients for the reference category to zero.
The model also exhibits the property of proportional odds in the sense that the ratio of probabilities for any two categories depends only on the difference in their linear predictors. This means that if we change a predictor variable, the relative probabilities of all non-reference categories change proportionally, maintaining their relative relationships.
This visualization demonstrates how coefficients in multinomial logistic regression are interpreted relative to a reference category. Each colored line shows how the probability of a class changes as a predictor increases. Class 2 (reference) has zero coefficients, while Classes 0 and 1 have their own coefficient sets. The slopes reflect the magnitude and direction of each class's coefficients.
The coefficient magnitudes shown here represent the effect of the predictor on the log-odds of each class relative to the reference. Positive coefficients increase the probability of that class as the predictor increases, while negative coefficients decrease it. The reference class (Class 2) serves as the baseline with coefficients fixed at zero.
Visualizing Multinomial Logistic Regression
Let's create visualizations to understand how multinomial logistic regression works with different numbers of categories and how the decision boundaries are formed.
Decision boundaries in multinomial logistic regression showing how the model partitions the feature space into regions corresponding to each class. Each colored region represents where the model predicts that class with highest probability.
Class probability surfaces demonstrating how multinomial logistic regression assigns probabilities to each class across the feature space. The overlapping contours show how probabilities transition smoothly between classes, with all probabilities summing to one at every point.
Class probabilities as a function of a single predictor variable, showing the characteristic S-shaped curves for each class. The probabilities sum to one by construction, demonstrating the softmax function's normalization property.
Log-odds relationships showing the linear dependence on predictor variables. The log-odds of each class relative to the reference class (Class 2) are linear functions of the predictor, which is the fundamental assumption of multinomial logistic regression.
Example
Let's work through a concrete example to understand how multinomial logistic regression works with actual numbers. We'll use a simple dataset with two predictor variables and three classes to make the calculations manageable.
Dataset Setup
Suppose we have a dataset with two features (age and income) and three possible outcomes (low, medium, high risk). We'll use "low" as our reference category and calculate the probabilities for "medium" and "high" risk.
| Age | Income | Risk Level |
|---|---|---|
| 25 | 30 | low |
| 30 | 40 | low |
| 35 | 50 | low |
| 40 | 60 | medium |
| 45 | 70 | medium |
| 50 | 80 | medium |
| 25 | 35 | medium |
| 30 | 45 | medium |
| 35 | 55 | high |
| 40 | 65 | high |
| 45 | 75 | high |
| 50 | 85 | high |
Coefficient Estimation
As mentioned in the formula section earlier, the coefficients in multinomial logistic regression are found by solving an optimization problem that maximizes the likelihood of the observed data. In practice, this is done using numerical algorithms such as Newton-Raphson or gradient-based methods, which iteratively adjust the coefficients to best fit the data.
For this example, let's assume we have already obtained the estimated coefficients. We'll now use these given coefficients to demonstrate how to calculate the predicted probabilities for a new observation.
Manual Calculation
Now let's manually calculate the multinomial logistic regression for an observation with age = 25 and income = 30 using these estimated coefficients:
Coefficients for medium vs low (reference):
- (intercept)
- (age coefficient)
- (income coefficient)
Coefficients for high vs low (reference):
- (intercept)
- (age coefficient)
- (income coefficient)
Step-by-Step Calculation
Step 1: Calculate Linear Predictors
For our observation with age = 25 and income = 30, we calculate the linear predictor for each category:
where:
- is the linear predictor for the medium risk category (the log-odds of medium risk relative to low risk before normalization)
- is the intercept for medium risk (the baseline log-odds when age and income are both zero)
- is the age coefficient for medium risk (the change in log-odds for each one-year increase in age)
- is the income coefficient for medium risk (the change in log-odds for each one-unit increase in income)
Substituting the values:
Similarly for high risk:
where:
- is the linear predictor for the high risk category (the log-odds of high risk relative to low risk before normalization)
- is the intercept for high risk (the baseline log-odds when age and income are both zero)
- is the age coefficient for high risk (the change in log-odds for each one-year increase in age)
- is the income coefficient for high risk (the change in log-odds for each one-unit increase in income)
Substituting the values:
For the reference category (low risk):
where:
- is the linear predictor for the low risk category (set to 0 because low risk is the reference category)
Step 2: Calculate Exponentials
We apply the exponential function to each linear predictor:
where:
- is Euler's number (approximately 2.71828)
- is the exponential of the linear predictor for medium risk (used in the softmax numerator)
where:
- is the exponential of the linear predictor for high risk (used in the softmax numerator)
where:
- is the exponential of the linear predictor for low risk (equals 1 because )
Step 3: Calculate Denominator
The denominator of the softmax function is the sum of all exponentials:
where:
- is the sum of exponentials across all categories (the normalization constant in the softmax function)
- is the total number of categories (low, medium, high)
Substituting the values:
Step 4: Calculate Probabilities
Using the softmax formula, we calculate the probability for each category:
where:
- is the predicted probability of low risk for this observation (computed by dividing the exponential of the low risk linear predictor by the sum of all exponentials)
where:
- is the predicted probability of medium risk for this observation (computed by dividing the exponential of the medium risk linear predictor by the sum of all exponentials)
where:
- is the predicted probability of high risk for this observation (computed by dividing the exponential of the high risk linear predictor by the sum of all exponentials)
Step 5: Verification
We verify that all probabilities sum to 1, as required by the softmax function:
Interpretation
The calculations show that for a 25-year-old with income 30, the model predicts:
- 5.5% probability of low risk
- 24.5% probability of medium risk
- 70.0% probability of high risk
The predicted class would be "high" since it has the highest probability. Notice how the probabilities sum to exactly 1, which is a fundamental property of the multinomial logistic regression model.
Implementation in Scikit-learn
Scikit-learn provides an efficient implementation of multinomial logistic regression through the LogisticRegression class. In this section, we'll walk through a complete example using a customer risk classification dataset, demonstrating how to prepare data, train the model, and interpret results.
Data Preparation
First, we'll create our dataset and prepare it for modeling. We'll use age and income as predictors to classify customers into three risk categories:
1import numpy as np
2import pandas as pd
3from sklearn.linear_model import LogisticRegression
4from sklearn.preprocessing import LabelEncoder
5from sklearn.model_selection import train_test_split
6from sklearn.metrics import (
7 classification_report,
8 confusion_matrix,
9 accuracy_score,
10 log_loss,
11)
12
13# Create sample dataset
14data = {
15 "age": [25, 30, 35, 40, 45, 50, 25, 30, 35, 40, 45, 50],
16 "income": [30, 40, 50, 60, 70, 80, 35, 45, 55, 65, 75, 85],
17 "risk": [
18 "low",
19 "low",
20 "low",
21 "medium",
22 "medium",
23 "medium",
24 "medium",
25 "medium",
26 "high",
27 "high",
28 "high",
29 "high",
30 ],
31}
32
33df = pd.DataFrame(data)
34
35# Prepare features and target
36X = df[["age", "income"]].values
37y = df["risk"].values
38
39# Encode categorical target variable
40le = LabelEncoder()
41y_encoded = le.fit_transform(y)
42
43# Split into training and test sets (stratified to ensure all classes in both sets)
44X_train, X_test, y_train, y_test = train_test_split(
45 X, y_encoded, test_size=0.3, random_state=42, stratify=y_encoded
46)1import numpy as np
2import pandas as pd
3from sklearn.linear_model import LogisticRegression
4from sklearn.preprocessing import LabelEncoder
5from sklearn.model_selection import train_test_split
6from sklearn.metrics import (
7 classification_report,
8 confusion_matrix,
9 accuracy_score,
10 log_loss,
11)
12
13# Create sample dataset
14data = {
15 "age": [25, 30, 35, 40, 45, 50, 25, 30, 35, 40, 45, 50],
16 "income": [30, 40, 50, 60, 70, 80, 35, 45, 55, 65, 75, 85],
17 "risk": [
18 "low",
19 "low",
20 "low",
21 "medium",
22 "medium",
23 "medium",
24 "medium",
25 "medium",
26 "high",
27 "high",
28 "high",
29 "high",
30 ],
31}
32
33df = pd.DataFrame(data)
34
35# Prepare features and target
36X = df[["age", "income"]].values
37y = df["risk"].values
38
39# Encode categorical target variable
40le = LabelEncoder()
41y_encoded = le.fit_transform(y)
42
43# Split into training and test sets (stratified to ensure all classes in both sets)
44X_train, X_test, y_train, y_test = train_test_split(
45 X, y_encoded, test_size=0.3, random_state=42, stratify=y_encoded
46)The LabelEncoder converts the categorical risk levels into numerical values (0, 1, 2), which is required for scikit-learn's implementation. We use stratified splitting to ensure all three risk categories appear in both training and test sets, which is important for reliable evaluation with small datasets.
Model Training
Now let's train the multinomial logistic regression model using the L-BFGS solver, which is well-suited for small to medium-sized datasets:
1# Initialize and train the model
2model = LogisticRegression(solver="lbfgs", max_iter=1000, random_state=42)
3model.fit(X_train, y_train)
4
5# Make predictions
6y_pred = model.predict(X_test)
7y_pred_proba = model.predict_proba(X_test)
8
9# Calculate performance metrics
10test_accuracy = accuracy_score(y_test, y_pred)
11train_accuracy = model.score(X_train, y_train)
12logloss = log_loss(y_test, y_pred_proba)1# Initialize and train the model
2model = LogisticRegression(solver="lbfgs", max_iter=1000, random_state=42)
3model.fit(X_train, y_train)
4
5# Make predictions
6y_pred = model.predict(X_test)
7y_pred_proba = model.predict_proba(X_test)
8
9# Calculate performance metrics
10test_accuracy = accuracy_score(y_test, y_pred)
11train_accuracy = model.score(X_train, y_train)
12logloss = log_loss(y_test, y_pred_proba)Model Coefficients
Let's examine the learned coefficients for each class to understand how age and income influence risk classification:
Class Encoding:
0: high
1: low
2: medium
Model Intercepts:
Class 0 (high): 3.8853
Class 1 (low): -1.6769
Class 2 (medium): -2.2084
Model Coefficients:
Class 0 (high):
Age: -0.9481
Income: 0.5568
Class 1 (low):
Age: 0.7673
Income: -0.4956
Class 2 (medium):
Age: 0.1808
Income: -0.0612
The coefficients reveal how each feature affects the log-odds of each class relative to the reference class (class 0, which corresponds to "high" risk in this encoding). Positive coefficients increase the probability of that class as the feature value increases, while negative coefficients decrease it. For instance, if the age coefficient for the "low" risk class is positive, older customers are more likely to be classified as low risk compared to the reference category. The intercepts represent the baseline log-odds when all features equal zero.
Model Performance
Let's evaluate the model's performance on the test set:
Model Performance: Training Accuracy: 1.000 Test Accuracy: 0.500 Log Loss: 0.591
The test accuracy indicates how often the model correctly predicts the risk category. In this case, the model achieves reasonable accuracy despite the very small dataset size. The log loss measures the quality of the probability predictions—lower values indicate better calibrated probabilities, with values closer to 0 being preferable. The small difference between training and test accuracy (if any) suggests the model is not overfitting significantly, which is encouraging given the limited data.
Predictions and Probabilities
Let's examine individual predictions to understand how the model assigns probabilities:
Sample Predictions on Test Set:
--------------------------------------------------------------------------------
Observation 1:
Features: Age=45, Income=75
True Class: high
Predicted Class: high
Probabilities:
high: 0.838
low: 0.001
medium: 0.161
Observation 2:
Features: Age=35, Income=50
True Class: low
Predicted Class: medium
Probabilities:
high: 0.051
low: 0.324
medium: 0.625
Observation 3:
Features: Age=40, Income=60
True Class: medium
Predicted Class: medium
Probabilities:
high: 0.109
low: 0.100
medium: 0.791
Observation 4:
Features: Age=30, Income=45
True Class: medium
Predicted Class: high
Probabilities:
high: 0.456
low: 0.106
medium: 0.438
Each prediction shows the complete probability distribution across all three risk categories. The model assigns the observation to the class with the highest probability, but the full distribution provides valuable information about prediction confidence. For example, if one class has a probability of 0.9, the model is very confident, whereas probabilities of 0.4, 0.35, and 0.25 indicate high uncertainty. Notice how the probabilities sum to 1.0 by construction, which is a fundamental property of the softmax function used in multinomial logistic regression.
Classification Report
The classification report provides detailed performance metrics for each class:
Classification Report:
precision recall f1-score support
high 0.50 1.00 0.67 1
low 0.00 0.00 0.00 1
medium 0.50 0.50 0.50 2
accuracy 0.50 4
macro avg 0.33 0.50 0.39 4
weighted avg 0.38 0.50 0.42 4
Confusion Matrix:
[[1 0 0]
[0 0 1]
[1 0 1]]
Rows represent true classes, columns represent predicted classes
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

t-SNE: Complete Guide to Dimensionality Reduction & High-Dimensional Data Visualization
A comprehensive guide covering t-SNE (t-Distributed Stochastic Neighbor Embedding), including mathematical foundations, probability distributions, KL divergence optimization, and practical implementation. Learn how to visualize complex high-dimensional datasets effectively.

LIME Explainability: Complete Guide to Local Interpretable Model-Agnostic Explanations
A comprehensive guide covering LIME (Local Interpretable Model-Agnostic Explanations), including mathematical foundations, implementation strategies, and practical applications. Learn how to explain any machine learning model's predictions with interpretable local approximations.

UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction
A comprehensive guide covering UMAP dimensionality reduction, including mathematical foundations, fuzzy simplicial sets, manifold learning, and practical implementation. Learn how to preserve both local and global structure in high-dimensional data visualization.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
