CART Decision Trees: Complete Guide to Classification and Regression Trees with Mathematical Foundations & Python Implementation

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Data Science Handbook

A comprehensive guide to CART (Classification and Regression Trees), including mathematical foundations, Gini impurity, variance reduction, and practical implementation with scikit-learn. Learn how to build interpretable decision trees for both classification and regression tasks.

Part of Data Science Handbook

This article is part of the free-to-read Data Science Handbook

View full handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Loading component...

Out[36]:

Visualization

Gini Impurity as a Function of Class Probability: This plot shows how Gini impurity varies with the probability of one class in a binary classification problem. When p=0 or p=1 (pure nodes with all samples belonging to one class), the Gini impurity is 0. The maximum impurity of 0.5 occurs when classes are perfectly balanced (p=0.5). The curve demonstrates why the Gini measure penalizes mixed nodes and rewards pure splits.

Regression: Mean Squared Error

For regression problems, CART minimizes the variance (or equivalently, the mean squared error) of the target variable within each leaf node. The variance for a node is calculated as:

$Var(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y})^2$

where:

$Var(S)$ is the variance of the target variable in node $S$ (a measure of how spread out the target values are around their mean)
$|S|$ is the number of samples in node $S$ (total count of data points in this node)
$y_i$ is the target value for sample $i$ (the actual observed response value for the $i$ -th sample in node $S$ )
$\bar{y}$ is the mean target value in the node (the average of all target values in node $S$ )
$i \in S$ indicates summation over all samples in node $S$

The mean target value $\bar{y}$ is calculated as:

$\bar{y} = \frac{1}{|S|} \sum_{i \in S} y_i$

where:

$\bar{y}$ is the mean target value (the arithmetic average of all target values in node $S$ , which becomes the prediction for this node)
$\frac{1}{|S|}$ is the normalization factor (ensures we compute the average rather than the sum)
$\sum_{i \in S} y_i$ is the sum of all target values in node $S$

The mean squared error is essentially the same as variance, just without the division by $|S|$ :

$MSE(S) = \sum_{i \in S} (y_i - \bar{y})^2$

where:

$MSE(S)$ is the mean squared error for node $S$ (the sum of squared deviations from the mean, used as an alternative to variance)
$(y_i - \bar{y})^2$ is the squared deviation of sample $i$ from the mean (measures how far each sample's target value is from the node's average)

For a split, we calculate the weighted sum of the MSEs of the child nodes:

$MSE_{split}(S) = MSE(S_L) + MSE(S_R)$

where:

$MSE_{split}(S)$ is the total mean squared error after splitting node $S$ (the sum of MSEs from both child nodes)
$MSE(S_L)$ is the mean squared error of the left child node (calculated as $\sum_{i \in S_L} (y_i - \bar{y}_L)^2$ )
$MSE(S_R)$ is the mean squared error of the right child node (calculated as $\sum_{i \in S_R} (y_i - \bar{y}_R)^2$ )

This is equivalent to the weighted variance approach used in classification, but since we're dealing with continuous values, we use the sum of squared deviations from the mean.

Split Selection Process

The algorithm evaluates all possible splits for each feature and selects the one that provides the maximum reduction in impurity or variance. For a continuous feature with values $x_1, x_2, ..., x_n$ , we consider splits of the form $x \leq t$ for all possible thresholds $t$ . For categorical features, we consider all possible subsets of categories.

The information gain (or variance reduction) for a split is:

$Gain = Impurity_{parent} - Impurity_{split}$

where:

$Gain$ is the information gain (the reduction in impurity achieved by the split, with higher values indicating better splits)
$Impurity_{parent}$ is the impurity of the parent node before splitting (either Gini impurity for classification or variance for regression)
$Impurity_{split}$ is the weighted impurity after splitting (the combined impurity of both child nodes)

For classification, this becomes:

$Gain = Gini(S) - Gini_{split}(S)$

where:

$Gini(S)$ is the Gini impurity of the parent node $S$ before splitting
$Gini_{split}(S)$ is the weighted Gini impurity after splitting (calculated as $\frac{|S_L|}{|S|} \cdot Gini(S_L) + \frac{|S_R|}{|S|} \cdot Gini(S_R)$ )

For regression, this becomes:

$Gain = Var(S) - \frac{|S_L|}{|S|} \cdot Var(S_L) - \frac{|S_R|}{|S|} \cdot Var(S_R)$

where:

$Var(S)$ is the variance of the parent node $S$ before splitting
$\frac{|S_L|}{|S|} \cdot Var(S_L)$ is the weighted variance of the left child node
$\frac{|S_R|}{|S|} \cdot Var(S_R)$ is the weighted variance of the right child node

The algorithm selects the split with the maximum gain, ensuring that each split provides the most significant improvement in model performance.

Out[37]:

Visualization

Split Selection Process: This visualization demonstrates how CART evaluates different split points on a feature to find the optimal split. The scatter plot shows data points colored by class, with vertical dashed lines representing candidate split thresholds. The bar chart below shows the information gain for each candidate split, with the optimal split (maximum gain) highlighted in green. This illustrates the greedy search process that CART uses to select the best split at each node.

Loading component...

Out[39]:

Visualization

CART Classification Decision Boundary: This visualization shows how CART creates rectangular decision regions through axis-parallel splits. Each colored region represents a different class prediction, with the boundaries formed by the tree's splitting decisions. The scatter points show the actual training data, demonstrating how the algorithm partitions the feature space to separate different classes.

Out[40]:

Visualization

CART Tree Structure Visualization: This dendrogram shows the complete decision tree structure for the classification problem. Each node displays the splitting criterion (feature and threshold), the Gini impurity, the number of samples, and the class distribution. The color intensity indicates the purity of each node, with darker colors representing more pure (less impure) nodes. The tree shows how the algorithm recursively partitions the data based on feature thresholds.

The tree visualization shows the hierarchical decision-making process, with each node displaying the splitting criterion and the resulting class distribution.

Out[41]:

Visualization

CART Regression Tree Predictions: This plot demonstrates how CART creates piecewise constant predictions for regression problems. The blue scatter points represent the training data, while the red step function shows the tree's predictions. Each horizontal segment corresponds to a leaf node in the tree, with the prediction value being the mean of all training samples that fall into that leaf.

Out[42]:

Visualization

Feature Importance in CART: This bar chart shows the relative importance of each feature in the CART model, calculated based on how much each feature reduces impurity across all splits in the tree. Higher values indicate features that contribute more to the model's decision-making process. This visualization helps identify which features are most influential for predictions.

Example

Let's work through a concrete example to understand how CART builds a decision tree step by step. We'll use a simple classification problem with two features and three classes.

Dataset Setup

Consider the following training data:

Sample	Feature 1 (X₁)	Feature 2 (X₂)	Class
1	2.5	1.2	A
2	3.1	2.8	B
3	1.8	3.2	A
4	4.2	1.5	C
5	2.9	2.1	B
6	1.5	2.9	A
7	3.8	2.3	C
8	2.2	1.8	A

Step 1: Calculate Initial Gini Impurity

First, we calculate the Gini impurity for the root node (all samples):

Class A: 4 samples (50%)
Class B: 2 samples (25%)
Class C: 2 samples (25%)

$Gini_{root} = 1 - (0.5^2 + 0.25^2 + 0.25^2) = 1 - (0.25 + 0.0625 + 0.0625) = 1 - 0.375 = 0.625$

Step 2: Evaluate All Possible Splits

Split on X₁ ≤ 2.5:

Left child (X₁ ≤ 2.5): Samples 1, 3, 6, 8

Class A: 4 samples (100%)
Class B: 0 samples (0%)
Class C: 0 samples (0%)

$Gini_{left} = 1 - (1.0^2 + 0^2 + 0^2) = 1 - 1.0 = 0$

Right child (X₁ > 2.5): Samples 2, 4, 5, 7

Class A: 0 samples (0%)
Class B: 2 samples (50%)
Class C: 2 samples (50%)

$Gini_{right} = 1 - (0^2 + 0.5^2 + 0.5^2) = 1 - (0 + 0.25 + 0.25) = 1 - 0.5 = 0.5$

Weighted Gini impurity: $Gini_{split} = \frac{4}{8} \times 0 + \frac{4}{8} \times 0.5 = 0.5 \times 0 + 0.5 \times 0.5 = 0 + 0.25 = 0.25$

Information gain: $Gain = 0.625 - 0.25 = 0.375$

Split on X₂ ≤ 2.0:

Left child (X₂ ≤ 2.0): Samples 1, 4, 8

Class A: 2 samples (66.7%)
Class B: 0 samples (0%)
Class C: 1 sample (33.3%)

$Gini_{left} = 1 - (0.667^2 + 0^2 + 0.333^2) = 1 - (0.445 + 0 + 0.111) = 0.444$

Right child (X₂ > 2.0): Samples 2, 3, 5, 6, 7

Class A: 2 samples (40%)
Class B: 2 samples (40%)
Class C: 1 sample (20%)

$Gini_{right} = 1 - (0.4^2 + 0.4^2 + 0.2^2) = 1 - (0.16 + 0.16 + 0.04) = 0.64$

Weighted Gini impurity: $Gini_{split} = \frac{3}{8} \times 0.444 + \frac{5}{8} \times 0.64 = 0.375 \times 0.444 + 0.625 \times 0.64 = 0.1665 + 0.4 = 0.5665$

Information gain: $Gain = 0.625 - 0.5665 = 0.0585$

Step 3: Select Best Split

The split on X₁ ≤ 2.5 provides the highest information gain (0.375), so we choose this as our first split.

Step 4: Continue Building the Tree

We now have two child nodes. The left child (X₁ ≤ 2.5) is already pure (all samples are class A), so it becomes a leaf node. The right child (X₁ > 2.5) still has mixed classes and needs further splitting.

For the right child, we repeat the process:

Current samples: 2, 4, 5, 7
Classes: B, C, B, C

The best split would be on X₂ ≤ 2.1, which would separate the classes perfectly.

Final Tree Structure

1Root: X₁ ≤ 2.5?
2├── Yes → Class A (Leaf)
3└── No → X₂ ≤ 2.1?
4    ├── Yes → Class B (Leaf)
5    └── No → Class C (Leaf)

1Root: X₁ ≤ 2.5?
2├── Yes → Class A (Leaf)
3└── No → X₂ ≤ 2.1?
4    ├── Yes → Class B (Leaf)
5    └── No → Class C (Leaf)

This example demonstrates how CART greedily selects splits that maximize information gain at each step, ultimately creating a tree that perfectly classifies the training data.

Implementation in Scikit-learn

Scikit-learn provides a comprehensive implementation of CART through the DecisionTreeClassifier and DecisionTreeRegressor classes. Let's walk through how to build, train, and evaluate decision trees for both classification and regression tasks.

Data Preparation

We'll start by importing the necessary libraries and creating synthetic datasets for both classification and regression examples.

In[43]:

1from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
2from sklearn.datasets import make_classification, make_regression
3from sklearn.model_selection import train_test_split, cross_val_score
4from sklearn.metrics import accuracy_score, mean_squared_error, classification_report
5from sklearn.tree import plot_tree
6import matplotlib.pyplot as plt
7import numpy as np
8
9# Create classification dataset with 1000 samples and 4 features
10X_class, y_class = make_classification(
11    n_samples=1000,
12    n_features=4,
13    n_redundant=0,
14    n_informative=4,
15    n_clusters_per_class=1,
16    random_state=42,
17)
18
19# Create regression dataset with 1000 samples and 4 features
20X_reg, y_reg = make_regression(n_samples=1000, n_features=4, noise=10, random_state=42)
21
22# Split both datasets into training and testing sets
23X_class_train, X_class_test, y_class_train, y_class_test = train_test_split(
24    X_class, y_class, test_size=0.2, random_state=42
25)
26
27X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
28    X_reg, y_reg, test_size=0.2, random_state=42
29)

1from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
2from sklearn.datasets import make_classification, make_regression
3from sklearn.model_selection import train_test_split, cross_val_score
4from sklearn.metrics import accuracy_score, mean_squared_error, classification_report
5from sklearn.tree import plot_tree
6import matplotlib.pyplot as plt
7import numpy as np
8
9# Create classification dataset with 1000 samples and 4 features
10X_class, y_class = make_classification(
11    n_samples=1000,
12    n_features=4,
13    n_redundant=0,
14    n_informative=4,
15    n_clusters_per_class=1,
16    random_state=42,
17)
18
19# Create regression dataset with 1000 samples and 4 features
20X_reg, y_reg = make_regression(n_samples=1000, n_features=4, noise=10, random_state=42)
21
22# Split both datasets into training and testing sets
23X_class_train, X_class_test, y_class_train, y_class_test = train_test_split(
24    X_class, y_class, test_size=0.2, random_state=42
25)
26
27X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
28    X_reg, y_reg, test_size=0.2, random_state=42
29)

Classification Implementation

Now let's build a decision tree classifier. We'll configure several key parameters to control the tree's complexity and prevent overfitting.

In[44]:

1clf = DecisionTreeClassifier(
2    max_depth=5,
3    min_samples_split=20,
4    min_samples_leaf=10,
5    max_features="sqrt",
6    random_state=42,
7)
8
9# Train the classifier on the training data
10clf.fit(X_class_train, y_class_train)
11
12# Make predictions on the test set
13y_class_pred = clf.predict(X_class_test)
14y_class_pred_proba = clf.predict_proba(X_class_test)
15
16# Calculate accuracy on the test set
17accuracy = accuracy_score(y_class_test, y_class_pred)
18
19# Perform 5-fold cross-validation on the training set
20cv_scores = cross_val_score(clf, X_class_train, y_class_train, cv=5)

1clf = DecisionTreeClassifier(
2    max_depth=5,
3    min_samples_split=20,
4    min_samples_leaf=10,
5    max_features="sqrt",
6    random_state=42,
7)
8
9# Train the classifier on the training data
10clf.fit(X_class_train, y_class_train)
11
12# Make predictions on the test set
13y_class_pred = clf.predict(X_class_test)
14y_class_pred_proba = clf.predict_proba(X_class_test)
15
16# Calculate accuracy on the test set
17accuracy = accuracy_score(y_class_test, y_class_pred)
18
19# Perform 5-fold cross-validation on the training set
20cv_scores = cross_val_score(clf, X_class_train, y_class_train, cv=5)

Let's examine the classification performance:

Out[45]:

Test Accuracy: 0.9250

Cross-Validation Scores: [0.88125 0.90625 0.9     0.93125 0.89375]
Mean CV Score: 0.9025 (+/- 0.0332)

The test accuracy shows how well the model performs on unseen data. The cross-validation scores provide a more robust estimate of model performance by averaging results across multiple train-test splits. The standard deviation indicates the stability of the model's performance—lower values suggest more consistent predictions across different data subsets.

Regression Implementation

For regression tasks, we use DecisionTreeRegressor with similar parameters but adjusted for the continuous target variable.

In[46]:

1reg = DecisionTreeRegressor(
2    max_depth=6,
3    min_samples_split=15,
4    min_samples_leaf=8,
5    max_features="sqrt",
6    random_state=42,
7)
8
9# Train the regressor on the training data
10reg.fit(X_reg_train, y_reg_train)
11
12# Make predictions on the test set
13y_reg_pred = reg.predict(X_reg_test)
14
15# Calculate root mean squared error
16mse = mean_squared_error(y_reg_test, y_reg_pred)
17rmse = np.sqrt(mse)
18
19# Perform 5-fold cross-validation
20cv_scores_reg = cross_val_score(
21    reg, X_reg_train, y_reg_train, cv=5, scoring="neg_mean_squared_error"
22)
23cv_rmse = np.sqrt(-cv_scores_reg)

1reg = DecisionTreeRegressor(
2    max_depth=6,
3    min_samples_split=15,
4    min_samples_leaf=8,
5    max_features="sqrt",
6    random_state=42,
7)
8
9# Train the regressor on the training data
10reg.fit(X_reg_train, y_reg_train)
11
12# Make predictions on the test set
13y_reg_pred = reg.predict(X_reg_test)
14
15# Calculate root mean squared error
16mse = mean_squared_error(y_reg_test, y_reg_pred)
17rmse = np.sqrt(mse)
18
19# Perform 5-fold cross-validation
20cv_scores_reg = cross_val_score(
21    reg, X_reg_train, y_reg_train, cv=5, scoring="neg_mean_squared_error"
22)
23cv_rmse = np.sqrt(-cv_scores_reg)

Let's examine the regression performance:

Out[47]:

Test RMSE: 54.3534

Cross-Validation RMSE: [64.15932352 51.64740602 52.72624029 47.96812047 52.09732228]
Mean CV RMSE: 53.7197 (+/- 10.9537)

The RMSE (Root Mean Squared Error) measures the average prediction error in the same units as the target variable. Lower RMSE values indicate better predictive accuracy. The cross-validation RMSE provides a more reliable estimate of how the model will perform on new data, with the standard deviation showing the consistency of predictions across different data splits.

Feature Importance Analysis

Decision trees automatically calculate feature importance based on how much each feature reduces impurity across all splits.

In[48]:

1feature_importance = clf.feature_importances_

1feature_importance = clf.feature_importances_

Out[49]:

Feature Importance Scores:

Feature 0: 0.7161
Feature 1: 0.0766
Feature 2: 0.0419
Feature 3: 0.1653

These importance scores indicate which features contribute most to the model's predictions. Features with higher scores have a greater influence on the decision-making process. In this example, the scores show how the tree prioritizes different features when making splits.

Tree Visualization

Visualizing the tree structure helps understand the decision-making process:

In[50]:

1plt.figure(figsize=(12, 12))
2plot_tree(
3    clf,
4    feature_names=[f"Feature_{i}" for i in range(X_class.shape[1])],
5    class_names=["Class_0", "Class_1"],
6    filled=True,
7    rounded=True,
8    fontsize=14,
9)
10plt.title("Decision Tree Structure", fontsize=16)
11plt.show()

1plt.figure(figsize=(12, 12))
2plot_tree(
3    clf,
4    feature_names=[f"Feature_{i}" for i in range(X_class.shape[1])],
5    class_names=["Class_0", "Class_1"],
6    filled=True,
7    rounded=True,
8    fontsize=14,
9)
10plt.title("Decision Tree Structure", fontsize=16)
11plt.show()

Out[50]:

Visualization

Note: This visualization is intended for illustrative purposes. It's normal if not all feature details are visible in the tree plot.

The tree visualization shows the complete decision-making hierarchy. Each node displays the splitting criterion (feature and threshold), the Gini impurity, the number of samples, and the class distribution. The color intensity indicates node purity—darker colors represent more homogeneous nodes. This visual representation makes it easy to trace any prediction back through the specific sequence of decisions that led to it.

Key Parameters

Below are the main parameters that control how CART decision trees work and perform.

max_depth: Maximum depth of the tree (default: None). Limiting depth prevents overfitting by restricting tree complexity. Start with values like 5-10 and adjust based on cross-validation performance.
min_samples_split: Minimum samples required to split an internal node (default: 2). Higher values prevent the tree from creating splits on small subsets, reducing overfitting. Values of 10-20 work well for most datasets.
min_samples_leaf: Minimum samples required in a leaf node (default: 1). This ensures leaves have sufficient data for reliable predictions. Values of 5-10 help prevent overfitting while maintaining predictive power.
max_features: Number of features to consider for the best split (default: None for all features). Options include 'sqrt' (square root of total features), 'log2' (log base 2 of total features), or an integer. Using fewer features can reduce overfitting and speed up training.
criterion: Function to measure split quality. For classification: 'gini' (default) or 'entropy'. For regression: 'squared_error' (default) or 'absolute_error'. Gini and squared_error are computationally faster and work well in most cases.
random_state: Seed for reproducibility (default: None). Set to an integer to ensure consistent results across runs, especially important when using max_features with randomization.
min_impurity_decrease: Minimum impurity decrease required for a split (default: 0.0). A node will split only if the decrease in impurity is greater than or equal to this value. Useful for pruning weak splits.
ccp_alpha: Complexity parameter for minimal cost-complexity pruning (default: 0.0). Higher values result in more aggressive pruning, creating simpler trees. Use cross-validation to find the optimal value.

Key Methods

The following are the most commonly used methods for interacting with CART decision tree models.

fit(X, y): Trains the decision tree on the training data X and target values y. The tree is built by recursively finding the best splits according to the specified criterion.
predict(X): Returns predicted class labels (classification) or values (regression) for input data X. For classification, returns the majority class in the leaf node. For regression, returns the mean value of samples in the leaf node.
predict_proba(X): Returns probability estimates for each class (classification only). Probabilities are calculated as the fraction of training samples of each class in the leaf node. Useful for setting custom decision thresholds or understanding prediction confidence.
score(X, y): Returns the mean accuracy (classification) or R² score (regression) on the given test data. Provides a quick way to evaluate model performance.
get_depth(): Returns the depth of the decision tree. Useful for understanding tree complexity and diagnosing potential overfitting.
get_n_leaves(): Returns the number of leaves in the decision tree. More leaves indicate a more complex model that may be overfitting.
apply(X): Returns the index of the leaf that each sample ends up in. Useful for understanding which samples follow similar decision paths or for creating custom features based on tree structure.

Practical Applications

When to Use CART

CART models excel in scenarios where interpretability and transparency are paramount. In healthcare applications, decision trees help clinicians understand diagnostic reasoning, making it easier to validate recommendations and explain decisions to patients. The ability to translate tree structures into simple if-then rules makes CART valuable in regulatory environments where model decisions must be explainable to auditors and stakeholders. For example, a credit scoring model built with CART can clearly show why an application was rejected, which is often a legal requirement.

CART is particularly effective for exploratory data analysis and initial model development. The tree structure naturally reveals which features are most important for predictions and how they interact, providing insights that can guide feature engineering for more complex models. When working with small to medium-sized datasets (typically under 10,000 samples), CART can deliver competitive performance while maintaining complete transparency. The algorithm also handles mixed data types naturally, making it suitable for datasets containing both categorical and numerical features without extensive preprocessing.

However, CART is less appropriate when maximum predictive accuracy is the primary goal, as ensemble methods like Random Forest or Gradient Boosting typically outperform single trees. Avoid CART when you need to extrapolate beyond the training data range, since decision trees can only make predictions within the feature space they've observed. For problems with complex geometric patterns or diagonal decision boundaries, algorithms like Support Vector Machines or Neural Networks may be more effective due to CART's limitation to axis-parallel splits.

Best Practices

To achieve optimal results with CART, start by carefully tuning the tree depth using cross-validation. Begin with max_depth values between 3 and 10, as deeper trees tend to overfit while shallow trees may underfit. Set min_samples_split to values between 10 and 50 to prevent the tree from creating splits based on small subsets of data, which often capture noise rather than meaningful patterns. Similarly, configure min_samples_leaf to require at least 5-10 samples in each leaf node, ensuring that predictions are based on sufficient evidence.

Use cross-validation to evaluate model performance rather than relying solely on training accuracy. A tree that achieves very high training accuracy may be overfitting and will likely perform poorly on new data. Consider using cost-complexity pruning by setting the ccp_alpha parameter, which removes branches that provide minimal improvement in predictive power. Start with ccp_alpha=0.0 and gradually increase it while monitoring cross-validation scores to find the optimal balance between model complexity and generalization.

For classification tasks, the default Gini impurity criterion works well in most cases and is computationally faster than entropy. However, if you're specifically interested in maximizing information gain or working with highly imbalanced classes, entropy may provide better results. When dealing with categorical features that have many categories, consider grouping rare categories together or using target encoding to reduce cardinality, as high-cardinality features can dominate the splitting process and lead to overfitting.

Data Requirements and Preprocessing

CART handles various data types with minimal preprocessing requirements, but certain considerations can significantly improve performance. While the algorithm doesn't strictly require feature scaling due to its use of rank-based splits, standardization can be beneficial when features have vastly different ranges, particularly if you plan to compare feature importance values across features. For categorical features, CART naturally handles them through binary splits, but encoding strategies become important when categories number in the hundreds or thousands.

Missing values require careful attention, as scikit-learn's implementation doesn't natively support them. You'll need to either impute missing values or use surrogate splits (available in some implementations). For imputation, consider using median values for numerical features and mode for categorical features, or more sophisticated methods like iterative imputation if missingness patterns are informative. When dealing with outliers, CART is relatively robust compared to distance-based algorithms, as extreme values only affect the specific splits where they appear rather than the entire model.

The algorithm performs best when the target variable has sufficient representation across all classes (for classification) or a reasonable distribution (for regression). For highly imbalanced classification problems, consider using class weights or resampling techniques before training. For regression, if the target variable spans several orders of magnitude, log-transforming it may help the tree create more meaningful splits and improve prediction accuracy on the original scale.

Common Pitfalls

One frequent mistake is allowing trees to grow too deep without proper regularization, resulting in models that memorize training data rather than learning generalizable patterns. This manifests as very high training accuracy but poor test performance. To avoid this, set explicit constraints on tree depth and minimum samples per split, and validate performance using cross-validation rather than a single train-test split. Another common error is selecting hyperparameters based solely on training performance, which typically leads to overfitting.

Relying exclusively on a single decision tree for production applications can be problematic due to the algorithm's high variance. Small changes in training data can produce dramatically different tree structures, making predictions unstable over time. If you observe significant performance fluctuations when retraining on slightly different data, consider using ensemble methods or implementing model averaging strategies. Additionally, practitioners sometimes overlook the importance of feature engineering, assuming that CART will automatically discover all relevant patterns. While the algorithm does capture interactions, creating domain-specific features often substantially improves performance.

Ignoring the interpretability-complexity tradeoff is another pitfall. Some users add numerous regularization parameters and complex preprocessing pipelines that negate CART's primary advantage of interpretability. If your final model requires extensive documentation to explain, you may have lost the benefit of using a decision tree in the first place. Finally, failing to validate that the tree's learned rules align with domain knowledge can lead to models that are mathematically optimal but practically meaningless. Review the top splits and feature importance to ensure they make sense in the context of your problem.

Computational Considerations

CART's computational complexity for training is O(n × m × log n) where n is the number of samples and m is the number of features, making it efficient for small to medium datasets. For datasets under 10,000 samples, training typically completes in seconds on modern hardware. However, performance degrades for very large datasets (>100,000 samples) or high-dimensional feature spaces (>1,000 features), where the algorithm must evaluate many potential splits at each node. In such cases, consider using sampling strategies to train on representative subsets or limiting the number of features considered at each split using the max_features parameter.

Memory requirements are generally modest, as the tree structure itself is compact—each node only stores the split criterion, threshold, and pointers to child nodes. A typical tree with 100 leaf nodes requires only a few kilobytes of memory. However, the training process requires keeping the full dataset in memory, which can be limiting for very large datasets. For datasets that don't fit in memory, consider using out-of-core learning approaches or distributed computing frameworks, though this often negates the simplicity advantage of CART.

Prediction time scales linearly with tree depth, making CART suitable for real-time applications where low latency is critical. A typical tree with depth 10 requires only 10 comparisons to make a prediction, which completes in microseconds. This makes CART particularly attractive for embedded systems, mobile applications, or high-throughput prediction scenarios. However, be aware that extremely deep trees (depth >20) or very wide trees (many branches at each node) can slow predictions, especially when processing large batches of data.

Performance and Deployment Considerations

Evaluating CART performance requires using appropriate metrics for your task. For classification, examine not just accuracy but also precision, recall, and F1-scores for each class, particularly if classes are imbalanced. The confusion matrix reveals which classes the tree confuses, often highlighting areas where additional features or different split criteria might help. For regression, consider both RMSE (which penalizes large errors) and MAE (which treats all errors equally) to understand the error distribution. Always compare performance against a simple baseline, such as predicting the majority class or mean value, to ensure the tree is learning meaningful patterns.

When deploying CART models to production, monitor for data drift that could degrade performance over time. Since trees cannot adapt to new patterns without retraining, establish a schedule for model updates based on performance monitoring. Track key metrics like prediction accuracy, feature distributions, and the proportion of samples reaching each leaf node. Significant changes in these metrics indicate that retraining may be necessary. Consider implementing A/B testing when deploying new tree versions to validate that changes improve real-world performance.

The interpretability of CART makes it well-suited for applications requiring model explainability, but this requires proper documentation. Export the tree structure as a set of rules and validate that they align with domain expertise. In regulated industries, maintain audit trails showing how the tree was trained, which hyperparameters were selected, and how performance was validated. For critical applications, implement safeguards such as confidence thresholds—only making predictions when the leaf node contains sufficient samples or has high class purity. This conservative approach can prevent errors on edge cases where the tree's training data was sparse.

Summary

CART represents a fundamental approach to machine learning that prioritizes interpretability and simplicity. By recursively partitioning the feature space based on impurity measures, CART creates decision trees that can be easily understood and validated by domain experts. The algorithm's greedy nature, while sometimes leading to suboptimal solutions, makes it computationally efficient and transparent in its decision-making process.

The mathematical foundation of CART, based on Gini impurity for classification and variance reduction for regression, provides a principled approach to finding optimal splits. However, the algorithm's tendency toward overfitting and its sensitivity to small data changes require careful hyperparameter tuning and validation strategies. The built-in feature selection and natural handling of missing values make CART particularly suitable for exploratory data analysis and initial model development.

In practice, CART serves as an excellent starting point for many machine learning problems, especially when interpretability is crucial. While ensemble methods often provide superior predictive performance, CART's transparency and simplicity make it invaluable in domains where understanding the model's reasoning is as important as its accuracy. The algorithm's ability to handle mixed data types and create human-readable decision rules positions it as a versatile tool in the data scientist's toolkit, particularly valuable for stakeholder communication and regulatory compliance scenarios.

Reference

BIBTEXAcademic

@misc{cartdecisiontreescompleteguidetoclassificationandregressiontreeswithmathematicalfoundationspythonimplementation, author = {Michael Brenndoerfer}, title = {CART Decision Trees: Complete Guide to Classification and Regression Trees with Mathematical Foundations & Python Implementation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/cart-decision-trees-classification-regression-mathematical-foundations-python-implementation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-02} }

APAAcademic

Michael Brenndoerfer (2025). CART Decision Trees: Complete Guide to Classification and Regression Trees with Mathematical Foundations & Python Implementation. Retrieved from https://mbrenndoerfer.com/writing/cart-decision-trees-classification-regression-mathematical-foundations-python-implementation

MLAAcademic

Michael Brenndoerfer. "CART Decision Trees: Complete Guide to Classification and Regression Trees with Mathematical Foundations & Python Implementation." 2025. Web. 11/2/2025. <https://mbrenndoerfer.com/writing/cart-decision-trees-classification-regression-mathematical-foundations-python-implementation>.

CHICAGOAcademic

Michael Brenndoerfer. "CART Decision Trees: Complete Guide to Classification and Regression Trees with Mathematical Foundations & Python Implementation." Accessed 11/2/2025. https://mbrenndoerfer.com/writing/cart-decision-trees-classification-regression-mathematical-foundations-python-implementation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'CART Decision Trees: Complete Guide to Classification and Regression Trees with Mathematical Foundations & Python Implementation'. Available at: https://mbrenndoerfer.com/writing/cart-decision-trees-classification-regression-mathematical-foundations-python-implementation (Accessed: 11/2/2025).

SimpleBasic

Michael Brenndoerfer (2025). CART Decision Trees: Complete Guide to Classification and Regression Trees with Mathematical Foundations & Python Implementation. https://mbrenndoerfer.com/writing/cart-decision-trees-classification-regression-mathematical-foundations-python-implementation

Direct link:

https://mbrenndoerfer.com/writing/cart-decision-trees-classification-regression-mathematical-foundations-python-implementation

Part of Data Science Handbook

This article is part of the free-to-read Data Science Handbook

View full handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications

InteractiveCART Decision Trees: Complete Guide to Classification and Regression Trees with Mathematical Foundations & Python Implementation

Regression: Mean Squared Error

Split Selection Process

Example

Dataset Setup

Step 1: Calculate Initial Gini Impurity

Step 2: Evaluate All Possible Splits

Step 3: Select Best Split

Step 4: Continue Building the Tree

Final Tree Structure

Implementation in Scikit-learn

Data Preparation

Classification Implementation

Regression Implementation

Feature Importance Analysis

Tree Visualization

Key Parameters

Key Methods

Practical Applications

When to Use CART

Best Practices

Data Requirements and Preprocessing

Common Pitfalls

Computational Considerations

Performance and Deployment Considerations

Summary

Reference

About the author: Michael Brenndoerfer

Related Content

t-SNE: Complete Guide to Dimensionality Reduction & High-Dimensional Data Visualization

LIME Explainability: Complete Guide to Local Interpretable Model-Agnostic Explanations

UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction

Stay updated

Sample	Feature 1 (X₁)	Feature 2 (X₂)	Class
1	2.5	1.2	A
2	3.1	2.8	B
3	1.8	3.2	A
4	4.2	1.5	C
5	2.9	2.1	B
6	1.5	2.9	A
7	3.8	2.3	C
8	2.2	1.8	A

Sample	Feature 1 (X₁)	Feature 2 (X₂)	Class
1	2.5	1.2	A
2	3.1	2.8	B
3	1.8	3.2	A
4	4.2	1.5	C
5	2.9	2.1	B
6	1.5	2.9	A
7	3.8	2.3	C
8	2.2	1.8	A

Sample	Feature 1 (X₁)	Feature 2 (X₂)	Class
1	2.5	1.2	A
2	3.1	2.8	B
3	1.8	3.2	A
4	4.2	1.5	C
5	2.9	2.1	B
6	1.5	2.9	A
7	3.8	2.3	C
8	2.2	1.8	A