Causal Language Modeling: The Foundation of Generative AI

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Learn how causal language modeling trains AI to predict the next token. Covers autoregressive factorization, cross-entropy loss, causal masking, scaling laws, and perplexity evaluation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Causal Language ModelingLink Copied

Language models learn to predict the next word. This simple objective, applied at massive scale, has produced the most capable AI systems ever built. GPT-4, Claude, LLaMA, and virtually every modern generative model share this foundation: given a sequence of tokens, predict what comes next.

Causal Language Modeling (CLM) is the formal name for this training objective. The "causal" refers to the direction of information flow: predictions depend only on past tokens, never on future ones. This constraint mirrors how humans produce language, word by word, and makes the learned model directly usable for text generation.

In this chapter, we'll unpack the mathematics behind CLM, understand why it works so well, implement the loss function from scratch, and explore the training data and scaling properties that have driven recent breakthroughs.

The Autoregressive FactorizationLink Copied

How do you assign a probability to an entire sentence? This is the fundamental question that language models must answer. Given a sequence of tokens $x_1, x_2, \ldots, x_n$ , we want to compute $P(x_1, x_2, \ldots, x_n)$ , the probability that this particular sequence occurs in natural language.

The Combinatorial ChallengeLink Copied

Consider what this means in practice. With a vocabulary of 50,000 tokens and sequences of length 100, we'd need to assign probabilities to $50000^{100}$ possible sequences. That's more possibilities than atoms in the observable universe, raised to a power larger than the universe's age in seconds. Storing or computing such a distribution directly is impossible.

We need a way to break this intractable joint probability into manageable pieces. Fortunately, probability theory gives us exactly such a tool: the chain rule.

The Chain Rule of ProbabilityLink Copied

The chain rule states that any joint probability can be decomposed into a product of conditional probabilities. For a sequence, this means:

P(x_1, x_2, \ldots, x_n) = P(x_1) \cdot P(x_2 | x_1) \cdot P(x_3 | x_1, x_2) \cdots P(x_n | x_1, \ldots, x_{n-1})

Let's trace through what each term represents:

$P(x_1)$ : The probability of the first token appearing at the start of a sequence. This has no conditioning context.
$P(x_2 | x_1)$ : Given we've seen the first token, what's the probability of the second?
$P(x_3 | x_1, x_2)$ : Given the first two tokens, what comes third?
And so on, until $P(x_n | x_1, \ldots, x_{n-1})$ : the final token, conditioned on everything before it.

This telescoping product captures the sequential nature of language. Each new word depends on what came before, exactly matching our intuition about how text is generated.

The Compact NotationLink Copied

We can express this factorization more concisely using product notation:

P(x_1, \ldots, x_n) = \prod_{t=1}^{n} P(x_t | x_{<t})

where:

$x_t$ : the token at position $t$ in the sequence
$x_{<t}$ : all tokens before position $t$ , that is, the sequence $(x_1, x_2, \ldots, x_{t-1})$
$P(x_t | x_{<t})$ : the conditional probability of token $x_t$ given all preceding tokens

For the first position where $t=1$ , we define $x_{<1}$ as empty, so $P(x_1 | x_{<1}) = P(x_1)$ .

Each factor $P(x_t | x_{<t})$ is a conditional distribution over the entire vocabulary. Given everything we've seen so far, what's the probability of each possible next token? This is precisely the question a language model learns to answer.

Autoregressive Modeling

A modeling approach where each output depends only on previous outputs, not future ones. The model generates sequences one step at a time, with each step conditioned on all prior steps. The term "autoregressive" comes from time series analysis where current values are regressed on past values.

Why This Factorization WorksLink Copied

This decomposition is mathematically exact, not an approximation. The chain rule holds for any joint distribution. What makes it practical is that we've transformed an impossible problem (representing $50000^{100}$ probabilities) into a tractable one (learning a function that outputs a distribution over 50,000 tokens given any context).

The key insight is that while contexts vary enormously, they share statistical patterns. The word "the" tends to be followed by nouns. Questions end with question marks. Technical documents use technical vocabulary. A neural network can learn these patterns and generalize them to new contexts it has never seen before.

The CLM ObjectiveLink Copied

Now that we understand how to decompose sequence probability, we need a way to train a model to produce good probability estimates. This requires two things: a loss function that measures prediction quality, and a mechanism to improve the model based on that measurement.

From Maximum Likelihood to Minimum LossLink Copied

Our goal is to find model parameters $\theta$ that make the training data as probable as possible. Given a training sequence $x_1, \ldots, x_n$ , we want to maximize:

P_\theta(x_1, \ldots, x_n) = \prod_{t=1}^{n} P_\theta(x_t | x_{<t})

Maximizing a product of many small probabilities is numerically unstable. As sequences grow longer, the product shrinks toward zero, causing underflow. The standard solution is to work with logarithms. Since $\log$ is monotonically increasing, maximizing a probability is equivalent to maximizing its log:

\log P_\theta(x_1, \ldots, x_n) = \sum_{t=1}^{n} \log P_\theta(x_t | x_{<t})

The product becomes a sum, which is numerically stable and computationally convenient. Now, optimization algorithms typically minimize rather than maximize, so we flip the sign to get our loss function:

\mathcal{L}(\theta) = -\sum_{t=1}^{n} \log P_\theta(x_t | x_{<t})

where:

$\mathcal{L}(\theta)$ : the loss function we minimize during training
$\theta$ : the model parameters (weights and biases of the neural network)
$n$ : the length of the training sequence
$P_\theta(x_t | x_{<t})$ : the probability the model assigns to the correct token $x_t$ given the context $x_{<t}$
$\log P_\theta(x_t | x_{<t})$ : the log-probability, which is negative since probabilities lie in $(0, 1)$

The Connection to Cross-EntropyLink Copied

This loss function has a beautiful interpretation: it's the cross-entropy between the model's predicted distribution and the true distribution (which puts all probability mass on the actual next token).

To see why, recall that cross-entropy measures how well a predicted distribution $Q$ matches a true distribution $P$ :

H(P, Q) = -\sum_x P(x) \log Q(x)

At each position $t$ , the "true distribution" is a one-hot vector: probability 1 for the actual token $x_t$ , and 0 for everything else. The model predicts a distribution $P_\theta(\cdot | x_{<t})$ over all vocabulary tokens. The cross-entropy simplifies to:

H = -1 \cdot \log P_\theta(x_t | x_{<t}) - \sum_{x \neq x_t} 0 \cdot \log P_\theta(x | x_{<t}) = -\log P_\theta(x_t | x_{<t})

Only the true token's probability matters. This is why cross-entropy loss is also called "negative log-likelihood" in language modeling contexts.

Understanding the Loss SignalLink Copied

The loss function creates an intuitive learning signal. When the model assigns high probability to the correct token, the loss contribution is small. When the model is surprised, the loss contribution is large.

Out[3]:

Visualization

Line plot showing negative log of probability decreasing from infinity near zero to zero near one, with steep descent for low probabilities. — The negative log-likelihood loss function. As the predicted probability for the correct token increases (x-axis), the loss decreases rapidly (y-axis). The curve is steep near zero, creating strong gradients when the model is confident but wrong. Near probability 1.0, the curve flattens, providing gentle gradients when the model is already performing well.

Consider these scenarios:

Confident and correct ( $P = 0.9$ ): $-\log(0.9) \approx 0.1$ . Small loss, weak gradient. The model is doing well here.
Uncertain ( $P = 0.1$ ): $-\log(0.1) \approx 2.3$ . Moderate loss. Room for improvement.
Confident and wrong ( $P = 0.01$ ): $-\log(0.01) \approx 4.6$ . Large loss, strong gradient. The model needs to update significantly.

This asymmetry is powerful: the model receives the strongest teaching signal precisely where it's making the biggest mistakes.

Efficient Learning from Every TokenLink Copied

A remarkable property of CLM loss is that it decomposes across positions. A single sequence of length $n$ provides $n$ independent gradient signals, one for each prediction. This is dramatically more efficient than classification tasks where a single input yields a single label.

Consider training on a document with 1,000 tokens. Each forward pass produces 1,000 predictions and 1,000 loss terms. The model learns from every token simultaneously, extracting maximum information from the training data. This efficiency is one reason language models can learn so much from their training corpora.

ImplementationLink Copied

Let's implement this loss function to see how it works in practice:

In[4]:

Code

import torch.nn.functional as F


def compute_clm_loss(logits, targets, ignore_index=-100):
    """
    Compute causal language modeling loss.

    Args:
        logits: Model outputs of shape (batch, seq_len, vocab_size)
        targets: Target token IDs of shape (batch, seq_len)
        ignore_index: Token ID to ignore in loss computation (e.g., padding)

    Returns:
        Scalar loss value (mean over all valid positions)
    """
    # Reshape for cross-entropy: (batch * seq_len, vocab_size)
    batch_size, seq_len, vocab_size = logits.shape
    logits_flat = logits.view(-1, vocab_size)
    targets_flat = targets.view(-1)

    # Cross-entropy computes -log(softmax(logits)[target])
    loss = F.cross_entropy(
        logits_flat, targets_flat, ignore_index=ignore_index, reduction="mean"
    )
    return loss

import torch.nn.functional as F


def compute_clm_loss(logits, targets, ignore_index=-100):
    """
    Compute causal language modeling loss.

    Args:
        logits: Model outputs of shape (batch, seq_len, vocab_size)
        targets: Target token IDs of shape (batch, seq_len)
        ignore_index: Token ID to ignore in loss computation (e.g., padding)

    Returns:
        Scalar loss value (mean over all valid positions)
    """
    # Reshape for cross-entropy: (batch * seq_len, vocab_size)
    batch_size, seq_len, vocab_size = logits.shape
    logits_flat = logits.view(-1, vocab_size)
    targets_flat = targets.view(-1)

    # Cross-entropy computes -log(softmax(logits)[target])
    loss = F.cross_entropy(
        logits_flat, targets_flat, ignore_index=ignore_index, reduction="mean"
    )
    return loss

Out[5]:

Console

Example CLM loss: 4.7462
Perplexity: 115.15
Random baseline (vocab=100): loss=4.6052, ppl=100

With random logits, the loss is close to $\log(V)$ where $V$ is the vocabulary size, because the model assigns roughly uniform probability $1/V$ to all tokens. As training progresses, the model learns to concentrate probability mass on likely continuations, reducing the loss.

Out[6]:

Visualization

Bar chart showing flat uniform distribution before training. — Before training, the model assigns nearly uniform probability across all tokens.

Bar chart showing peaked distribution after training. — After training, probability concentrates on likely continuations given the context.

From Sequence to Training ExamplesLink Copied

A key insight of CLM is that a single sequence yields multiple training examples. For a sequence of length $n$ , we predict tokens at positions 2 through $n$ using contexts of increasing length.

Consider the sentence "The cat sat on the mat". We create training pairs:

Training pairs from the sentence "The cat sat on the mat", showing how each position predicts the next token.

Context	Target
[START]	The
[START] The	cat
[START] The cat	sat
[START] The cat sat	on
[START] The cat sat on	the
[START] The cat sat on the	mat

The model sees the same sequence but learns from every position simultaneously. This is implemented using a clever shifting trick:

In[7]:

Code

def prepare_clm_batch(token_ids):
    """
    Prepare input and target tensors for CLM training.

    Input tokens are [0:n-1], targets are [1:n].
    The model predicts each token given all previous tokens.
    """
    # Input: all tokens except the last
    input_ids = token_ids[:, :-1]

    # Target: all tokens except the first (shifted by 1)
    labels = token_ids[:, 1:]

    return input_ids, labels


# Example: tokenized sequence
tokens = torch.tensor(
    [[101, 2054, 3921, 2068, 1999, 2023, 102]]
)  # [CLS] ... [SEP]
inputs, targets = prepare_clm_batch(tokens)

def prepare_clm_batch(token_ids):
    """
    Prepare input and target tensors for CLM training.

    Input tokens are [0:n-1], targets are [1:n].
    The model predicts each token given all previous tokens.
    """
    # Input: all tokens except the last
    input_ids = token_ids[:, :-1]

    # Target: all tokens except the first (shifted by 1)
    labels = token_ids[:, 1:]

    return input_ids, labels


# Example: tokenized sequence
tokens = torch.tensor(
    [[101, 2054, 3921, 2068, 1999, 2023, 102]]
)  # [CLS] ... [SEP]
inputs, targets = prepare_clm_batch(tokens)

Out[8]:

Console

Original sequence: [101, 2054, 3921, 2068, 1999, 2023, 102]
Input tokens:      [101, 2054, 3921, 2068, 1999, 2023]
Target tokens:     [2054, 3921, 2068, 1999, 2023, 102]

Training pairs:
  Context: [101] → Target: 2054
  Context: [101, 2054] → Target: 3921
  Context: [101, 2054, 3921] → Target: 2068
  Context: [101, 2054, 3921, 2068] → Target: 1999
  Context: [101, 2054, 3921, 2068, 1999] → Target: 2023
  Context: [101, 2054, 3921, 2068, 1999, 2023] → Target: 102

The inputs and targets are offset by one position. At position $i$ , the model receives tokens $0, 1, \ldots, i$ as input and predicts token $i+1$ . This offset is applied once during preprocessing, and the model processes all positions in parallel during training.

Causal MaskingLink Copied

For the autoregressive factorization to hold, the model must not "peek" at future tokens when predicting the current one. In transformer architectures, this is enforced through causal masking in the attention mechanism.

The attention pattern looks like this:

Out[9]:

Visualization

Heatmap showing a lower-triangular causal attention mask where each position can only attend to itself and earlier positions. — Causal attention mask for a sequence of 6 tokens. White cells indicate allowed attention (value 0), dark cells indicate blocked attention (value negative infinity). Position 3 can attend to positions 0, 1, 2, and 3, but not to positions 4 or 5. This lower-triangular pattern ensures information flows only from past to present.

In code, the causal mask is applied to attention scores before the softmax:

In[10]:

Code

def create_causal_mask(seq_len):
    """
    Create a causal attention mask.

    Returns a matrix where position (i, j) is:
    - 0 if i >= j (can attend)
    - -inf if i < j (cannot attend)
    """
    # Create upper triangular matrix of ones (above diagonal)
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)

    # Replace 1s with -inf to zero out attention after softmax
    mask = mask.masked_fill(mask == 1, float("-inf"))

    return mask


# Example usage in attention
def masked_attention(query, key, value, mask):
    """Simplified scaled dot-product attention with causal masking."""
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / (d_k**0.5)

    # Apply causal mask
    scores = scores + mask

    # Softmax: -inf becomes 0 probability
    attn_weights = F.softmax(scores, dim=-1)

    return torch.matmul(attn_weights, value), attn_weights

def create_causal_mask(seq_len):
    """
    Create a causal attention mask.

    Returns a matrix where position (i, j) is:
    - 0 if i >= j (can attend)
    - -inf if i < j (cannot attend)
    """
    # Create upper triangular matrix of ones (above diagonal)
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)

    # Replace 1s with -inf to zero out attention after softmax
    mask = mask.masked_fill(mask == 1, float("-inf"))

    return mask


# Example usage in attention
def masked_attention(query, key, value, mask):
    """Simplified scaled dot-product attention with causal masking."""
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / (d_k**0.5)

    # Apply causal mask
    scores = scores + mask

    # Softmax: -inf becomes 0 probability
    attn_weights = F.softmax(scores, dim=-1)

    return torch.matmul(attn_weights, value), attn_weights

Out[11]:

Console

Causal mask (4 positions):
[[  0. -inf -inf -inf]
 [  0.   0. -inf -inf]
 [  0.   0.   0. -inf]
 [  0.   0.   0.   0.]]

Attention weights after masking and softmax:
[[1.    0.    0.    0.   ]
 [0.5   0.5   0.    0.   ]
 [0.333 0.333 0.333 0.   ]
 [0.25  0.25  0.25  0.25 ]]

The negative infinity values become zero probability after softmax, effectively blocking information flow from future positions. Each row sums to 1.0, distributing attention only over valid (past and present) positions.

A Working Example: Training a Tiny CLMLink Copied

Let's train a minimal causal language model to see the complete pipeline. We'll use a character-level model on a small text to keep things interpretable:

In[12]:

Code

import torch

# Sample text for training
text = """To be or not to be that is the question
Whether tis nobler in the mind to suffer
The slings and arrows of outrageous fortune"""

# Character-level tokenization
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}
vocab_size = len(chars)

# Encode the text
encoded = torch.tensor([char_to_idx[c] for c in text])

import torch

# Sample text for training
text = """To be or not to be that is the question
Whether tis nobler in the mind to suffer
The slings and arrows of outrageous fortune"""

# Character-level tokenization
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}
vocab_size = len(chars)

# Encode the text
encoded = torch.tensor([char_to_idx[c] for c in text])

Out[13]:

Console

Vocabulary size: 22 characters
Characters: 
 TWabdefghilmnoqrstuw
Text length: 124 characters
Encoded (first 50): [2, 15, 1, 5, 7, 1, 15, 17, 1, 14, 15, 19, 1, 19, 15, 1, 5, 7, 1, 19, 10, 4, 19, 1, 11, 18, 1, 19, 10, 7, 1, 16, 20, 7, 18, 19, 11, 15, 14, 0, 3, 10, 7, 19, 10, 7, 17, 1, 19, 11]

Our tiny corpus contains only 27 unique characters (letters, spaces, and newlines). This small vocabulary means the model has fewer options to choose between at each step, making learning feasible even with limited data. The encoded representation converts each character to its integer index, ready for embedding lookup.

Now let's define a simple transformer-based language model:

In[14]:

Code

class TinyCausalLM(nn.Module):
    def __init__(
        self, vocab_size, d_model=64, n_heads=4, n_layers=2, max_len=128
    ):
        super().__init__()
        self.d_model = d_model

        # Token and position embeddings
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)

        # Transformer layers
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_model * 4,
            dropout=0.1,
            batch_first=True,
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=n_layers
        )

        # Output projection
        self.output = nn.Linear(d_model, vocab_size)

        # Register causal mask buffer
        self.register_buffer("causal_mask", None)

    def forward(self, x):
        batch_size, seq_len = x.shape

        # Create position indices
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0)

        # Combine embeddings
        h = self.token_emb(x) + self.pos_emb(positions)

        # Create causal mask if needed
        if self.causal_mask is None or self.causal_mask.size(0) < seq_len:
            mask = torch.triu(
                torch.ones(seq_len, seq_len, device=x.device), diagonal=1
            )
            self.causal_mask = mask.masked_fill(mask == 1, float("-inf"))

        # Apply transformer with causal masking
        h = self.transformer(
            h, mask=self.causal_mask[:seq_len, :seq_len], is_causal=True
        )

        # Project to vocabulary
        logits = self.output(h)

        return logits

class TinyCausalLM(nn.Module):
    def __init__(
        self, vocab_size, d_model=64, n_heads=4, n_layers=2, max_len=128
    ):
        super().__init__()
        self.d_model = d_model

        # Token and position embeddings
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)

        # Transformer layers
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_model * 4,
            dropout=0.1,
            batch_first=True,
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=n_layers
        )

        # Output projection
        self.output = nn.Linear(d_model, vocab_size)

        # Register causal mask buffer
        self.register_buffer("causal_mask", None)

    def forward(self, x):
        batch_size, seq_len = x.shape

        # Create position indices
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0)

        # Combine embeddings
        h = self.token_emb(x) + self.pos_emb(positions)

        # Create causal mask if needed
        if self.causal_mask is None or self.causal_mask.size(0) < seq_len:
            mask = torch.triu(
                torch.ones(seq_len, seq_len, device=x.device), diagonal=1
            )
            self.causal_mask = mask.masked_fill(mask == 1, float("-inf"))

        # Apply transformer with causal masking
        h = self.transformer(
            h, mask=self.causal_mask[:seq_len, :seq_len], is_causal=True
        )

        # Project to vocabulary
        logits = self.output(h)

        return logits

Out[15]:

Console

Model parameters: 110,998
Model architecture: TinyCausalLM(
  (token_emb): Embedding(22, 64)
  (pos_emb): Embedding(128, 64)
  (transformer): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
        )
        (linear1): Linear(in_features=64, out_features=256, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=256, out_features=64, bias=True)
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (output): Linear(in_features=64, out_features=22, bias=True)
)

With roughly 56,000 parameters, this is a tiny model by modern standards (GPT-3 has 175 billion). Yet even this small architecture captures the essential CLM structure: embeddings, transformer layers with causal masking, and a final projection to vocabulary logits.

Let's train this model for a few hundred steps:

In[16]:

Code

def get_batch(data, batch_size=32, seq_len=32):
    """Get a random batch of sequences."""
    # Random starting positions
    starts = torch.randint(0, len(data) - seq_len - 1, (batch_size,))

    # Extract sequences
    x = torch.stack([data[s : s + seq_len] for s in starts])
    y = torch.stack([data[s + 1 : s + seq_len + 1] for s in starts])

    return x, y


# Training loop
model = TinyCausalLM(vocab_size)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

losses = []
for step in range(500):
    # Get batch
    x, y = get_batch(encoded, batch_size=16, seq_len=32)

    # Forward pass
    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    losses.append(loss.item())

def get_batch(data, batch_size=32, seq_len=32):
    """Get a random batch of sequences."""
    # Random starting positions
    starts = torch.randint(0, len(data) - seq_len - 1, (batch_size,))

    # Extract sequences
    x = torch.stack([data[s : s + seq_len] for s in starts])
    y = torch.stack([data[s + 1 : s + seq_len + 1] for s in starts])

    return x, y


# Training loop
model = TinyCausalLM(vocab_size)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

losses = []
for step in range(500):
    # Get batch
    x, y = get_batch(encoded, batch_size=16, seq_len=32)

    # Forward pass
    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    losses.append(loss.item())

Out[17]:

Console

Initial loss: 3.1818 (random baseline: 3.0910)
Final loss: 0.1173
Final perplexity: 1.12

The loss dropped significantly from the random baseline. Starting near $\log(27) \approx 3.3$ (uniform distribution over 27 characters), the model converged to a much lower loss. The final perplexity indicates the model is roughly 4-5 characters uncertain at each position, down from 27 at initialization.

Let's visualize the training dynamics:

Out[18]:

Visualization

Line plot showing training loss decreasing from around 3.5 to 1.5 over 500 training steps. — Training loss curve for the character-level causal language model. The loss drops rapidly in early training as the model learns basic character statistics and common patterns. The convergence around 1.5 indicates the model has learned much of the structure but retains uncertainty about exact character choices.

Now let's generate text from the trained model using autoregressive sampling:

In[19]:

Code

def generate(model, start_text, max_new_tokens=100, temperature=0.8):
    """Generate text autoregressively."""
    model.eval()

    # Encode the prompt
    tokens = [char_to_idx[c] for c in start_text]
    tokens = torch.tensor(tokens).unsqueeze(0)

    with torch.no_grad():
        for _ in range(max_new_tokens):
            # Get predictions for last position
            logits = model(tokens)
            next_logits = logits[0, -1, :] / temperature

            # Sample from distribution
            probs = F.softmax(next_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)

            # Append to sequence
            tokens = torch.cat([tokens, next_token.unsqueeze(0)], dim=1)

    # Decode
    generated = "".join([idx_to_char[t.item()] for t in tokens[0]])
    return generated

def generate(model, start_text, max_new_tokens=100, temperature=0.8):
    """Generate text autoregressively."""
    model.eval()

    # Encode the prompt
    tokens = [char_to_idx[c] for c in start_text]
    tokens = torch.tensor(tokens).unsqueeze(0)

    with torch.no_grad():
        for _ in range(max_new_tokens):
            # Get predictions for last position
            logits = model(tokens)
            next_logits = logits[0, -1, :] / temperature

            # Sample from distribution
            probs = F.softmax(next_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)

            # Append to sequence
            tokens = torch.cat([tokens, next_token.unsqueeze(0)], dim=1)

    # Decode
    generated = "".join([idx_to_char[t.item()] for t in tokens[0]])
    return generated

The temperature parameter controls how "peaked" or "flat" the probability distribution becomes before sampling. Let's visualize this effect:

Out[20]:

Visualization

Three overlapping line plots showing probability distributions at different temperatures, with lower temperature creating a sharper peak. — Effect of temperature on the next-token probability distribution. Lower temperatures (T=0.3) sharpen the distribution, making the model more deterministic by concentrating probability on the most likely tokens. Higher temperatures (T=1.5) flatten the distribution, increasing randomness and diversity in generation. Temperature 1.0 uses the raw model probabilities.

Out[21]:

Console

Generated text samples:
--------------------------------------------------
Prompt: 'To be'
Output: To be or or not be that is the ques questinonoblestioble gestiobl

Prompt: 'The '
Output: The slings and arrows of outrageous fouthe fows f ofonge age 
Th

Prompt: 'Whether'
Output: Whether tis nobler in the mind to gsle to towsd sroblto sufetind mi

The generated text isn't perfect, but the model has learned character-level patterns from just 170 characters of Shakespeare. It produces plausible letter sequences and occasionally hits recognizable words. With more data and capacity, this same objective scales to GPT-4.

Training Data for CLMLink Copied

The quality and scale of training data fundamentally shapes what a causal language model learns. Modern LLMs are trained on datasets containing trillions of tokens, carefully curated from diverse sources.

Common training data sources include:

Web crawls: Common Crawl, C4, and similar filtered web scrapes provide broad coverage of internet text. Heavy filtering removes spam, duplicates, and low-quality content.
Books: Project Gutenberg, Books3, and licensed book corpora provide long-form, well-edited text that teaches narrative structure and coherent reasoning.
Code: GitHub, Stack Overflow, and code documentation help models understand programming languages and technical reasoning.
Scientific literature: Papers from arXiv, PubMed, and semantic scholar provide technical depth and formal reasoning.
Curated datasets: Wikipedia, news articles, and human-written examples balance quality with scale.

Data quality matters enormously. A model trained on Reddit comments writes like Reddit. A model trained on textbooks writes like textbooks. The mixture of sources directly influences the model's capabilities, style, and failure modes.

Deduplication is critical: repeated text causes models to memorize rather than generalize. Modern pipelines use MinHash, exact substring matching, or embedding-based deduplication to remove near-duplicate documents.

Scaling PropertiesLink Copied

Perhaps the most remarkable property of CLM is how predictably it scales. As we increase model size, dataset size, and compute, performance improves following consistent power laws.

The scaling laws discovered by Kaplan et al. (2020) and refined by Hoffmann et al. (2022) empirically characterize how test loss depends on model size and training data. The key finding is that loss follows a power-law relationship with both factors:

L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty

where:

$L(N, D)$ : the cross-entropy loss on held-out test data, as a function of model size and data
$N$ : the number of trainable model parameters (excluding embeddings)
$D$ : the number of training tokens the model has seen
$\alpha_N \approx 0.076$ : the scaling exponent for model size (how quickly loss improves as $N$ grows)
$\alpha_D \approx 0.095$ : the scaling exponent for data (how quickly loss improves as $D$ grows)
$N_c$ and $D_c$ : fitted constants that set the scale (roughly $8.8 \times 10^{13}$ and $5.4 \times 10^{13}$ respectively)
$L_\infty$ : the irreducible loss, representing fundamental uncertainty in language that no model can eliminate

The formula has three additive terms. The first term $\left(\frac{N_c}{N}\right)^{\alpha_N}$ captures model capacity limitations: smaller models have higher loss. The second term $\left(\frac{D_c}{D}\right)^{\alpha_D}$ captures data limitations: less training data means higher loss. The third term $L_\infty$ is the floor, around 1.69 nats, representing the inherent unpredictability of natural language.

This equation reveals that loss decreases as a power law with both model size and data. There's no plateau in sight: 10x more compute yields roughly 0.1 lower loss, consistently across many orders of magnitude.

Out[22]:

Visualization

Log-log plot showing power law decrease in loss as compute budget increases from 10^18 to 10^24 FLOPs. — Theoretical scaling behavior of language model loss with compute budget. The y-axis shows cross-entropy loss (lower is better), while the x-axis shows compute in FLOPs on a log scale. Following Chinchilla scaling, larger compute budgets should be split roughly equally between model size and training tokens.

The practical implication is clear: if you want a better language model, train a bigger model on more data with more compute. This insight has driven the race to scale, from GPT-2's 1.5 billion parameters to models with hundreds of billions.

Crucially, scaling also improves emergent capabilities. Models below a certain size cannot perform multi-step reasoning, follow complex instructions, or write working code. Above threshold scales, these abilities appear suddenly, a phenomenon called emergence.

Perplexity: The Standard MetricLink Copied

Perplexity is the standard evaluation metric for language models. While cross-entropy loss is useful for training, perplexity provides a more interpretable measure of model quality. It answers the question: on average, how many tokens is the model choosing between at each step?

Perplexity is defined as the exponential of the average negative log-likelihood:

\text{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(x_i | x_{<i})\right)

where:

$\text{PPL}$ : perplexity, the evaluation metric (lower is better)
$N$ : the total number of tokens in the evaluation dataset
$P(x_i | x_{<i})$ : the probability the model assigns to token $x_i$ given its context
$-\frac{1}{N}\sum_{i=1}^{N} \log P(x_i | x_{<i})$ : the average cross-entropy loss per token

The exponential converts the average log-probability back to a probability-like scale. If the model achieves an average loss of 2.3 nats per token, the perplexity is $e^{2.3} \approx 10$ .

The key intuition is this: a perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 equally likely options at each step. A perplexity of 100 would mean 100-way uncertainty. State-of-the-art models achieve perplexities below 10 on standard benchmarks like WikiText-103, meaning they often predict the correct next word with high confidence.

Out[23]:

Visualization

Bar chart showing peaked distribution with perplexity 3. — Perplexity of 3: Model is very confident, concentrating probability on few tokens.

Bar chart showing moderate distribution with perplexity 10. — Perplexity of 10: Moderate uncertainty, spreading probability across more options.

Bar chart showing flat distribution with perplexity 100. — Perplexity of 100: High uncertainty, nearly uniform distribution.

In[24]:

Code

def compute_perplexity(model, data, seq_len=64):
    """Compute perplexity on a dataset."""
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for start in range(0, len(data) - seq_len - 1, seq_len):
            x = data[start : start + seq_len].unsqueeze(0)
            y = data[start + 1 : start + seq_len + 1].unsqueeze(0)

            logits = model(x)
            loss = F.cross_entropy(
                logits.view(-1, vocab_size), y.view(-1), reduction="sum"
            )

            total_loss += loss.item()
            total_tokens += seq_len

    avg_loss = total_loss / total_tokens
    perplexity = np.exp(avg_loss)
    return perplexity, avg_loss

def compute_perplexity(model, data, seq_len=64):
    """Compute perplexity on a dataset."""
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for start in range(0, len(data) - seq_len - 1, seq_len):
            x = data[start : start + seq_len].unsqueeze(0)
            y = data[start + 1 : start + seq_len + 1].unsqueeze(0)

            logits = model(x)
            loss = F.cross_entropy(
                logits.view(-1, vocab_size), y.view(-1), reduction="sum"
            )

            total_loss += loss.item()
            total_tokens += seq_len

    avg_loss = total_loss / total_tokens
    perplexity = np.exp(avg_loss)
    return perplexity, avg_loss

Out[25]:

Console

Perplexity on training data: 3.92
Average loss: 1.3663
Interpretation: Model is ~4-way uncertain on average

This perplexity on training data shows how well the model has fit the corpus. Since we're evaluating on the same text we trained on, this is an optimistic estimate. Held-out test perplexity would be higher, reflecting true generalization ability.

Limitations and ImpactLink Copied

Causal language modeling has revolutionized NLP, but it comes with important limitations that shape how we use these models in practice.

The unidirectional constraint means CLM models cannot naturally incorporate future context. For tasks like filling in the middle of a sentence or bidirectional understanding, this is a fundamental limitation. Models like BERT use masked language modeling to capture bidirectional dependencies, trading generation capability for richer representations. In practice, many applications now use CLM models with careful prompting to work around this constraint.

Training on next-token prediction creates models that are excellent at mimicking patterns in training data but may struggle with factual accuracy. A model can fluently generate text about events that never happened or facts that are simply wrong. The objective optimizes for plausibility, not truth. This has led to significant research in retrieval augmentation and grounding techniques that anchor model outputs in verified information.

The compute requirements are staggering. Training frontier models costs tens of millions of dollars and consumes megawatt-hours of electricity. This concentrates capability in a few well-resourced organizations and raises sustainability concerns. Techniques like distillation, quantization, and efficient architectures aim to democratize access, but the gap between frontier and accessible models remains wide.

Despite these limitations, CLM has unlocked capabilities that seemed impossible a decade ago. Modern LLMs can write code, translate languages, answer questions, and engage in open-ended conversation. They serve as foundations for instruction-following, reasoning, and tool-using agents. The simplicity of the objective belies the complexity of what emerges from optimizing it at scale.

Key ParametersLink Copied

When training causal language models, several parameters significantly impact performance:

d_model: The hidden dimension of the transformer. Larger values (512, 768, 1024) increase capacity but require more compute. Our tiny model used 64.
n_heads: Number of attention heads. Should divide d_model evenly. More heads allow the model to attend to different aspects of context simultaneously.
n_layers: Depth of the transformer stack. Deeper models can learn more complex patterns but are slower to train. Production models use 12-96 layers.
learning_rate: Typically 1e-4 to 6e-4 for transformers. Higher rates speed training but risk instability. Warmup schedules help stabilize early training.
batch_size: Larger batches provide more stable gradients but require more memory. Modern LLMs use effective batch sizes in the millions of tokens.
seq_len (context length): Maximum sequence length the model can process. Longer contexts enable better understanding but increase memory quadratically with attention.
temperature: Controls randomness during generation. Values near 0 produce deterministic, repetitive output. Values near 1 produce diverse, sometimes incoherent text. Typical range: 0.7-1.0.

SummaryLink Copied

Causal language modeling trains models to predict the next token given all previous tokens. This chapter covered the key concepts:

Autoregressive factorization decomposes sequence probability into a product of conditional probabilities, each predicting one token from its left context
The CLM objective minimizes cross-entropy loss between predicted and actual next tokens, providing dense gradient signals from every position
Causal masking in attention layers enforces the left-to-right information flow, preventing the model from seeing future tokens during training
Training data quality and scale directly determine model capabilities, with modern LLMs consuming trillions of curated tokens
Scaling laws show predictable improvements as compute, data, and parameters increase, following power-law relationships
Perplexity measures model quality as the exponential of average loss, with lower values indicating better predictions

The next chapter explores masked language modeling, a bidirectional alternative that trades generation capability for richer contextual representations.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about causal language modeling and next-token prediction.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Recurrent Memory

Next Chapter

Masked Language Modeling

Reference

BIBTEXAcademic

@misc{causallanguagemodelingthefoundationofgenerativeai, author = {Michael Brenndoerfer}, title = {Causal Language Modeling: The Foundation of Generative AI}, year = {2025}, url = {https://mbrenndoerfer.com/writing/causal-language-modeling-foundation-generative-ai}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Causal Language Modeling: The Foundation of Generative AI. Retrieved from https://mbrenndoerfer.com/writing/causal-language-modeling-foundation-generative-ai

MLAAcademic

Michael Brenndoerfer. "Causal Language Modeling: The Foundation of Generative AI." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/causal-language-modeling-foundation-generative-ai>.

CHICAGOAcademic

Michael Brenndoerfer. "Causal Language Modeling: The Foundation of Generative AI." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/causal-language-modeling-foundation-generative-ai.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Causal Language Modeling: The Foundation of Generative AI'. Available at: https://mbrenndoerfer.com/writing/causal-language-modeling-foundation-generative-ai (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Causal Language Modeling: The Foundation of Generative AI. https://mbrenndoerfer.com/writing/causal-language-modeling-foundation-generative-ai

Direct link:

https://mbrenndoerfer.com/writing/causal-language-modeling-foundation-generative-ai

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Causal Language Modeling: The Foundation of Generative AI

Causal Language ModelingLink Copied

The Autoregressive FactorizationLink Copied

The Combinatorial ChallengeLink Copied

The Chain Rule of ProbabilityLink Copied

The Compact NotationLink Copied

Why This Factorization WorksLink Copied

The CLM ObjectiveLink Copied

From Maximum Likelihood to Minimum LossLink Copied

The Connection to Cross-EntropyLink Copied

Understanding the Loss SignalLink Copied

Efficient Learning from Every TokenLink Copied

ImplementationLink Copied

From Sequence to Training ExamplesLink Copied

Causal MaskingLink Copied

A Working Example: Training a Tiny CLMLink Copied

Training Data for CLMLink Copied

Scaling PropertiesLink Copied

Perplexity: The Standard MetricLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Prefix Language Modeling: Combining Bidirectional Context with Causal Generation

Denoising Objectives: BART's Corruption Strategies for Language Models

Replaced Token Detection: ELECTRA's Efficient Pretraining Objective

Stay updated