GPT-2: Scaling Language Models for Zero-Shot Learning

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Explore GPT-2's architecture, model sizes, WebText training, and zero-shot capabilities that transformed language modeling through scale.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

GPT-2Link Copied

In February 2019, OpenAI announced GPT-2 with an unusual strategy: they would not release the full model. The 1.5 billion parameter language model was, they claimed, too dangerous. It could generate text so convincingly human-like that they feared widespread misuse for fake news, spam, and impersonation. The decision sparked fierce debate about responsible AI release, but it also underscored a more fundamental shift. GPT-2 demonstrated that scale, applied to a simple autoregressive language modeling objective, could produce emergent capabilities that no one had explicitly trained for.

This chapter examines GPT-2's contributions to language modeling. We'll explore how OpenAI scaled up from GPT-1's 117 million parameters across four model sizes, understand the architectural refinements that enabled stable training at scale, analyze the WebText dataset that taught the model from high-quality internet text, and investigate the zero-shot learning phenomenon that made GPT-2 far more than just a text generator. By the end, you'll understand why GPT-2 marked the transition from fine-tuned specialists to general-purpose language models.

Model Sizes: A Family of ModelsLink Copied

GPT-2 wasn't a single model but a family of four. OpenAI trained models ranging from 117 million to 1.5 billion parameters, enabling systematic study of how capabilities scale with size. This approach would become standard practice for subsequent large language models.

GPT-2 Model Variants

GPT-2 Small matches GPT-1's size at 117M parameters with 12 layers. GPT-2 Medium doubles this to 345M parameters with 24 layers. GPT-2 Large reaches 762M with 36 layers. GPT-2 XL, the full model, contains 1.5B parameters across 48 layers.

The architectural specifications for each variant are:

GPT-2 model specifications. Head dimension remains constant at 64 across all variants, with larger models achieving greater capacity through more heads and layers.

Parameter	GPT-2 Small	GPT-2 Medium	GPT-2 Large	GPT-2 XL
Layers ( $L$ )	12	24	36	48
Hidden size ( $d_{model}$ )	768	1024	1280	1600
Attention heads ( $A$ )	12	16	20	25
Head dimension ( $d_{model}/A$ )	64	64	64	64
Feed-forward size	3072	4096	5120	6400
Vocabulary size	50,257	50,257	50,257	50,257
Context length	1024	1024	1024	1024
Parameters	~117M	~345M	~762M	~1.5B

The key architectural parameters are:

$L$ : number of transformer layers (blocks stacked sequentially)
$d_{model}$ : hidden dimension, the size of token representations flowing through the network
$A$ : number of attention heads operating in parallel within each layer
$d_{model}/A$ : dimension per attention head, determining how much information each head can capture

The head dimension stays fixed at 64 across all variants, following a pattern that would persist through GPT-3 and beyond. Scaling happens through more attention heads and more layers, not larger heads. The 4x feed-forward expansion ratio (hidden size $\times$ 4 = feed-forward size) also remains constant. Context length doubled from GPT-1's 512 to 1024 tokens, allowing the model to condition on longer passages.

Let's compute the parameter distribution to understand where model capacity resides:

In[3]:

Code

def count_gpt2_parameters(
    vocab_size: int = 50257,
    hidden_size: int = 768,
    num_layers: int = 12,
    max_position: int = 1024,
) -> dict:
    """Count parameters in each component of GPT-2."""
    intermediate_size = hidden_size * 4
    params = {}

    # Embedding layers (token and position)
    params["token_embeddings"] = vocab_size * hidden_size
    params["position_embeddings"] = max_position * hidden_size

    # Per-layer parameters
    # Self-attention: Q, K, V projections + output projection (all combined in c_attn, c_proj)
    attention_params = (
        3 * (hidden_size * hidden_size) + 3 * hidden_size
    )  # c_attn
    attention_params += hidden_size * hidden_size + hidden_size  # c_proj
    # Feed-forward: two linear layers (c_fc, c_proj)
    ff_params = hidden_size * intermediate_size + intermediate_size  # c_fc
    ff_params += intermediate_size * hidden_size + hidden_size  # c_proj
    # Layer norms (2 per layer)
    layernorm_params = 4 * hidden_size

    params["per_layer"] = attention_params + ff_params + layernorm_params
    params["all_layers"] = num_layers * params["per_layer"]

    # Final layer norm
    params["final_layernorm"] = 2 * hidden_size

    # Output projection (tied with token embeddings, so not counted separately)
    params["output_projection"] = 0  # Weight tying

    # Total
    params["embeddings_total"] = (
        params["token_embeddings"] + params["position_embeddings"]
    )
    params["total"] = (
        params["embeddings_total"]
        + params["all_layers"]
        + params["final_layernorm"]
    )

    return params

def count_gpt2_parameters(
    vocab_size: int = 50257,
    hidden_size: int = 768,
    num_layers: int = 12,
    max_position: int = 1024,
) -> dict:
    """Count parameters in each component of GPT-2."""
    intermediate_size = hidden_size * 4
    params = {}

    # Embedding layers (token and position)
    params["token_embeddings"] = vocab_size * hidden_size
    params["position_embeddings"] = max_position * hidden_size

    # Per-layer parameters
    # Self-attention: Q, K, V projections + output projection (all combined in c_attn, c_proj)
    attention_params = (
        3 * (hidden_size * hidden_size) + 3 * hidden_size
    )  # c_attn
    attention_params += hidden_size * hidden_size + hidden_size  # c_proj
    # Feed-forward: two linear layers (c_fc, c_proj)
    ff_params = hidden_size * intermediate_size + intermediate_size  # c_fc
    ff_params += intermediate_size * hidden_size + hidden_size  # c_proj
    # Layer norms (2 per layer)
    layernorm_params = 4 * hidden_size

    params["per_layer"] = attention_params + ff_params + layernorm_params
    params["all_layers"] = num_layers * params["per_layer"]

    # Final layer norm
    params["final_layernorm"] = 2 * hidden_size

    # Output projection (tied with token embeddings, so not counted separately)
    params["output_projection"] = 0  # Weight tying

    # Total
    params["embeddings_total"] = (
        params["token_embeddings"] + params["position_embeddings"]
    )
    params["total"] = (
        params["embeddings_total"]
        + params["all_layers"]
        + params["final_layernorm"]
    )

    return params

Out[4]:

Console

GPT-2 Small:
  Embeddings: 39,383,808 (31.6%)
  Transformer Layers: 85,054,464 (68.3%)
  Total: 124,439,808

GPT-2 Medium:
  Embeddings: 52,511,744 (14.8%)
  Transformer Layers: 302,309,376 (85.2%)
  Total: 354,823,168

GPT-2 Large:
  Embeddings: 65,639,680 (8.5%)
  Transformer Layers: 708,387,840 (91.5%)
  Total: 774,030,080

GPT-2 XL:
  Embeddings: 82,049,600 (5.3%)
  Transformer Layers: 1,475,558,400 (94.7%)
  Total: 1,557,611,200

The parameter breakdown reveals where model capacity resides. For GPT-2 Small, embeddings constitute about 31% of parameters, but this fraction shrinks dramatically as models scale. GPT-2 XL dedicates over 90% of its parameters to transformer layers, meaning the vast majority of learned knowledge resides in attention and feed-forward weights rather than the vocabulary embeddings.

Out[5]:

Visualization

Bar chart showing parameter counts for four GPT-2 model sizes from 117M to 1.5B. — Parameter distribution across GPT-2 model sizes. As models grow, transformer layers dominate increasingly while embedding overhead becomes proportionally smaller.

The embedding layer represents a decreasing fraction of total parameters as models grow. In GPT-2 Small, embeddings account for roughly 31% of parameters, but in GPT-2 XL this drops to about 6%. This pattern reflects a fundamental principle: vocabulary embeddings scale with vocabulary size, while transformer layers scale quadratically with hidden dimension and linearly with depth. For large models, knowledge increasingly resides in the transformer layers themselves.

Architectural Changes from GPT-1Link Copied

GPT-2 maintained the decoder-only transformer architecture from GPT-1 but introduced several modifications that proved crucial for stable training at larger scales. These changes became standard practice for subsequent language models.

Pre-NormalizationLink Copied

The most significant architectural change was moving layer normalization before each sub-block rather than after. GPT-1 and the original transformer used post-normalization, applying LayerNorm after the residual connection. GPT-2 switched to pre-normalization, applying LayerNorm before the attention and feed-forward operations.

In[6]:

Code

class GPT2Block(nn.Module):
    """GPT-2 transformer block with pre-normalization."""

    def __init__(
        self, hidden_size: int = 768, num_heads: int = 12, dropout: float = 0.1
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_heads = num_heads

        # Pre-normalization: LayerNorm BEFORE attention and FFN
        self.ln_1 = nn.LayerNorm(hidden_size, eps=1e-5)
        self.ln_2 = nn.LayerNorm(hidden_size, eps=1e-5)

        # Attention (simplified for clarity)
        self.c_attn = nn.Linear(
            hidden_size, 3 * hidden_size
        )  # Q, K, V combined
        self.c_proj = nn.Linear(hidden_size, hidden_size)

        # Feed-forward
        self.c_fc = nn.Linear(hidden_size, 4 * hidden_size)
        self.c_proj_ffn = nn.Linear(4 * hidden_size, hidden_size)

        self.dropout = nn.Dropout(dropout)

    def forward(
        self, x: torch.Tensor, mask: torch.Tensor = None
    ) -> torch.Tensor:
        # Pre-norm attention
        residual = x
        x = self.ln_1(x)  # Normalize BEFORE attention
        x = self._attention(x, mask)
        x = residual + self.dropout(x)  # Residual connection

        # Pre-norm feed-forward
        residual = x
        x = self.ln_2(x)  # Normalize BEFORE FFN
        x = self.c_fc(x)
        x = F.gelu(x)  # GELU activation
        x = self.c_proj_ffn(x)
        x = residual + self.dropout(x)  # Residual connection

        return x

    def _attention(
        self, x: torch.Tensor, mask: torch.Tensor = None
    ) -> torch.Tensor:
        B, T, C = x.size()
        head_dim = C // self.num_heads

        # Compute Q, K, V
        qkv = self.c_attn(x)
        q, k, v = qkv.split(C, dim=-1)

        # Reshape for multi-head attention
        q = q.view(B, T, self.num_heads, head_dim).transpose(1, 2)
        k = k.view(B, T, self.num_heads, head_dim).transpose(1, 2)
        v = v.view(B, T, self.num_heads, head_dim).transpose(1, 2)

        # Scaled dot-product attention with causal mask
        # Scale by 1/sqrt(d_k) to prevent dot products from growing too large
        # which would push softmax into regions with tiny gradients
        scale = 1.0 / (head_dim**0.5)
        attn = torch.matmul(q, k.transpose(-2, -1)) * scale

        # Apply causal mask
        causal_mask = torch.triu(
            torch.ones(T, T, device=x.device), diagonal=1
        ).bool()
        attn = attn.masked_fill(causal_mask, float("-inf"))

        attn = F.softmax(attn, dim=-1)
        attn = self.dropout(attn)

        # Apply attention to values
        out = torch.matmul(attn, v)
        out = out.transpose(1, 2).contiguous().view(B, T, C)

        return self.c_proj(out)

class GPT2Block(nn.Module):
    """GPT-2 transformer block with pre-normalization."""

    def __init__(
        self, hidden_size: int = 768, num_heads: int = 12, dropout: float = 0.1
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_heads = num_heads

        # Pre-normalization: LayerNorm BEFORE attention and FFN
        self.ln_1 = nn.LayerNorm(hidden_size, eps=1e-5)
        self.ln_2 = nn.LayerNorm(hidden_size, eps=1e-5)

        # Attention (simplified for clarity)
        self.c_attn = nn.Linear(
            hidden_size, 3 * hidden_size
        )  # Q, K, V combined
        self.c_proj = nn.Linear(hidden_size, hidden_size)

        # Feed-forward
        self.c_fc = nn.Linear(hidden_size, 4 * hidden_size)
        self.c_proj_ffn = nn.Linear(4 * hidden_size, hidden_size)

        self.dropout = nn.Dropout(dropout)

    def forward(
        self, x: torch.Tensor, mask: torch.Tensor = None
    ) -> torch.Tensor:
        # Pre-norm attention
        residual = x
        x = self.ln_1(x)  # Normalize BEFORE attention
        x = self._attention(x, mask)
        x = residual + self.dropout(x)  # Residual connection

        # Pre-norm feed-forward
        residual = x
        x = self.ln_2(x)  # Normalize BEFORE FFN
        x = self.c_fc(x)
        x = F.gelu(x)  # GELU activation
        x = self.c_proj_ffn(x)
        x = residual + self.dropout(x)  # Residual connection

        return x

    def _attention(
        self, x: torch.Tensor, mask: torch.Tensor = None
    ) -> torch.Tensor:
        B, T, C = x.size()
        head_dim = C // self.num_heads

        # Compute Q, K, V
        qkv = self.c_attn(x)
        q, k, v = qkv.split(C, dim=-1)

        # Reshape for multi-head attention
        q = q.view(B, T, self.num_heads, head_dim).transpose(1, 2)
        k = k.view(B, T, self.num_heads, head_dim).transpose(1, 2)
        v = v.view(B, T, self.num_heads, head_dim).transpose(1, 2)

        # Scaled dot-product attention with causal mask
        # Scale by 1/sqrt(d_k) to prevent dot products from growing too large
        # which would push softmax into regions with tiny gradients
        scale = 1.0 / (head_dim**0.5)
        attn = torch.matmul(q, k.transpose(-2, -1)) * scale

        # Apply causal mask
        causal_mask = torch.triu(
            torch.ones(T, T, device=x.device), diagonal=1
        ).bool()
        attn = attn.masked_fill(causal_mask, float("-inf"))

        attn = F.softmax(attn, dim=-1)
        attn = self.dropout(attn)

        # Apply attention to values
        out = torch.matmul(attn, v)
        out = out.transpose(1, 2).contiguous().view(B, T, C)

        return self.c_proj(out)

The causal mask is crucial for autoregressive generation. It ensures each position can only attend to previous positions (and itself), preventing information leakage from future tokens. Let's visualize what this mask looks like:

Out[7]:

Visualization

Heatmap showing triangular causal attention pattern where each token attends only to previous tokens. — Causal attention mask for an 8-token sequence. Each row shows which positions a token can attend to. Position 0 (first token) can only see itself, while position 7 (last token) can attend to all previous positions.

The lower triangular pattern means token $t_i$ can only attend to tokens $t_0, t_1, \ldots, t_i$ . This is what makes GPT-2 autoregressive: when predicting the next token, the model cannot "peek" at future tokens. During training, this enables efficient parallel computation since the entire sequence can be processed in one forward pass with masking.

Why does this matter? Pre-normalization provides more stable gradients during training. In post-normalization, the residual branch adds unnormalized activations, which can have high variance. The subsequent LayerNorm must handle this variance, but gradients flowing back through the normalization can become unstable at scale. With pre-normalization, the residual connection adds already-normalized values, keeping the residual stream's magnitude more controlled.

Out[8]:

Visualization

Diagram showing post-normalization transformer block architecture. — Post-normalization (GPT-1): LayerNorm applied after residual connections.

Diagram showing pre-normalization transformer block architecture. — Pre-normalization (GPT-2): LayerNorm applied before attention and FFN blocks.

GELU ActivationLink Copied

GPT-2 replaced the ReLU activation function with GELU (Gaussian Error Linear Unit). While ReLU simply zeros out negative values, GELU applies a smooth, non-monotonic transformation that weights inputs by their probability under a Gaussian distribution.

The GELU function multiplies each input by its probability of being greater than other inputs from a Gaussian distribution. This creates a smooth gating mechanism where positive values pass through almost unchanged, while negative values are attenuated based on how negative they are.

The exact formulation is:

\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

where:

$x$ : the input value to the activation function
$\Phi(x)$ : the cumulative distribution function (CDF) of the standard normal distribution, representing the probability that a random variable $X \sim \mathcal{N}(0, 1)$ is less than or equal to $x$
$\text{erf}(\cdot)$ : the error function, a special function that arises in probability and statistics

The intuition is elegant: we multiply each input $x$ by the probability that $x$ exceeds values drawn from a Gaussian. Large positive values have $\Phi(x) \approx 1$ , so they pass through unchanged. Large negative values have $\Phi(x) \approx 0$ , so they're zeroed out. Values near zero get partially attenuated.

In practice, computing the error function is expensive. GPT-2 and most implementations use this polynomial approximation:

\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right]\right)

where $\tanh$ is the hyperbolic tangent function and the constants $\sqrt{2/\pi} \approx 0.7979$ and $0.044715$ were chosen to minimize approximation error. This formulation is differentiable everywhere and computationally efficient.

Out[9]:

Visualization

Line plot comparing ReLU and GELU activations showing GELU's smooth curve versus ReLU's sharp corner. — Comparison of ReLU and GELU activation functions. GELU provides a smooth transition rather than a hard cutoff at zero, allowing small gradient flow for slightly negative inputs.

GELU's smoothness provides better gradient flow compared to ReLU's hard cutoff. To understand why this matters for training, let's compare the derivatives:

Out[10]:

Visualization

Line plot comparing derivatives of ReLU and GELU showing GELU's smooth gradient transition. — Derivatives of ReLU and GELU activation functions. ReLU has a discontinuous derivative at zero (0 for negative, 1 for positive), while GELU transitions smoothly, allowing gradients to flow through slightly negative values.

The shaded region shows where GELU outperforms ReLU: for slightly negative inputs, ReLU has zero gradient (the "dying ReLU" problem), while GELU allows some gradient to flow. This small difference compounds across billions of training steps and millions of neurons, potentially making the difference between successful and failed optimization at large scales.

Other ModificationsLink Copied

Several additional changes improved GPT-2's training stability:

Increased vocabulary size: GPT-2 uses a vocabulary of 50,257 tokens (versus GPT-1's ~40,000), built using byte-level Byte Pair Encoding. This allows the model to represent any text without unknown tokens.
Extended context length: The context window doubled from 512 to 1024 tokens, enabling the model to condition on longer passages.
Modified initialization: Weights in residual layers were scaled by $1/\sqrt{N}$ , where $N$ is the number of residual layers (each transformer block has 2 residual connections, so for a 12-layer model, $N = 24$ ). Without this scaling, each residual addition increases the variance of activations. After $N$ additions, variance would grow proportionally to $N$ . Scaling by $1/\sqrt{N}$ keeps the variance of the residual stream approximately constant regardless of depth.
Final layer normalization: An additional LayerNorm is applied after the last transformer block, before the output projection.

Let's implement a minimal GPT-2 model to see these components together:

In[11]:

Code

class GPT2(nn.Module):
    """Minimal GPT-2 implementation."""

    def __init__(
        self,
        vocab_size: int = 50257,
        hidden_size: int = 768,
        num_layers: int = 12,
        num_heads: int = 12,
        max_position: int = 1024,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Token and position embeddings
        self.wte = nn.Embedding(vocab_size, hidden_size)
        self.wpe = nn.Embedding(max_position, hidden_size)
        self.drop = nn.Dropout(dropout)

        # Transformer blocks
        self.blocks = nn.ModuleList(
            [
                GPT2Block(hidden_size, num_heads, dropout)
                for _ in range(num_layers)
            ]
        )

        # Final layer norm (pre-norm style means we need this at the end)
        self.ln_f = nn.LayerNorm(hidden_size, eps=1e-5)

        # Output projection (tied with token embeddings)
        self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False)
        self.lm_head.weight = self.wte.weight  # Weight tying

        # Initialize weights
        self.apply(self._init_weights)
        # Scale residual projections
        for block in self.blocks:
            block.c_proj.weight.data *= 1.0 / np.sqrt(2 * num_layers)
            block.c_proj_ffn.weight.data *= 1.0 / np.sqrt(2 * num_layers)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        B, T = input_ids.size()
        device = input_ids.device

        # Get embeddings
        position_ids = torch.arange(T, device=device).unsqueeze(0)
        token_emb = self.wte(input_ids)
        pos_emb = self.wpe(position_ids)
        x = self.drop(token_emb + pos_emb)

        # Apply transformer blocks
        for block in self.blocks:
            x = block(x)

        # Final layer norm and output projection
        x = self.ln_f(x)
        logits = self.lm_head(x)

        return logits

class GPT2(nn.Module):
    """Minimal GPT-2 implementation."""

    def __init__(
        self,
        vocab_size: int = 50257,
        hidden_size: int = 768,
        num_layers: int = 12,
        num_heads: int = 12,
        max_position: int = 1024,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Token and position embeddings
        self.wte = nn.Embedding(vocab_size, hidden_size)
        self.wpe = nn.Embedding(max_position, hidden_size)
        self.drop = nn.Dropout(dropout)

        # Transformer blocks
        self.blocks = nn.ModuleList(
            [
                GPT2Block(hidden_size, num_heads, dropout)
                for _ in range(num_layers)
            ]
        )

        # Final layer norm (pre-norm style means we need this at the end)
        self.ln_f = nn.LayerNorm(hidden_size, eps=1e-5)

        # Output projection (tied with token embeddings)
        self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False)
        self.lm_head.weight = self.wte.weight  # Weight tying

        # Initialize weights
        self.apply(self._init_weights)
        # Scale residual projections
        for block in self.blocks:
            block.c_proj.weight.data *= 1.0 / np.sqrt(2 * num_layers)
            block.c_proj_ffn.weight.data *= 1.0 / np.sqrt(2 * num_layers)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        B, T = input_ids.size()
        device = input_ids.device

        # Get embeddings
        position_ids = torch.arange(T, device=device).unsqueeze(0)
        token_emb = self.wte(input_ids)
        pos_emb = self.wpe(position_ids)
        x = self.drop(token_emb + pos_emb)

        # Apply transformer blocks
        for block in self.blocks:
            x = block(x)

        # Final layer norm and output projection
        x = self.ln_f(x)
        logits = self.lm_head(x)

        return logits

Out[12]:

Console

GPT-2 Small: 124,439,808 parameters

GPT-2 Medium: 354,823,168 parameters

GPT-2 Large: 774,030,080 parameters

GPT-2 XL: 1,557,611,200 parameters

Our implementation produces parameter counts closely matching the official GPT-2 specifications. Small discrepancies arise from minor implementation details like bias terms and layer norm parameters, but the overall scaling matches: roughly 3x increase from Small to Medium, 2.2x from Medium to Large, and 2x from Large to XL. Weight tying between the input embeddings and output projection reduces parameter count while maintaining performance, a technique that became standard in language models.

WebText: Learning from the InternetLink Copied

GPT-2's training data represented a significant departure from previous approaches. Rather than using carefully curated datasets like BooksCorpus (used for GPT-1) or Wikipedia, OpenAI created WebText by scraping outbound links from Reddit posts with at least 3 upvotes.

WebText Dataset

WebText contains approximately 40 GB of text from 8 million web pages. The filtering heuristic of requiring Reddit upvotes served as a proxy for content quality, as humans had implicitly judged these links worth sharing.

The WebText construction process involved several key decisions:

Source selection: Links from Reddit with $\geq 3$ karma (upvotes minus downvotes) were collected. This crowdsourced filtering biased toward content that humans found interesting, informative, or entertaining.
Deduplication: Near-duplicate documents were removed to prevent the model from memorizing repeated content.
Minimal preprocessing: Unlike many previous datasets, WebText preserved formatting, punctuation, and case. The model learned from text as it appears naturally on the web.
Exclusion of Wikipedia: Wikipedia was deliberately excluded to enable fair evaluation on Wikipedia-based benchmarks.

This data collection strategy reflected a key insight: the internet contains enormous diversity of text genres, styles, and topics. By learning from this diversity, GPT-2 could potentially generalize to many tasks without task-specific training.

Out[13]:

Visualization

Pie chart showing approximate content type distribution in WebText dataset. — Estimated composition of WebText by content type. The dataset spans news, blogs, forums, educational content, and creative writing, providing diverse training signal.

The WebText approach had important implications. First, it scaled easily since the internet provides virtually unlimited text. Second, the diversity exposed the model to many natural "tasks" embedded in web text: answering questions, summarizing articles, translating between languages, and explaining concepts. Third, it introduced biases present in Reddit's user base and the content they share.

Zero-Shot Task PerformanceLink Copied

GPT-2's most surprising contribution was demonstrating zero-shot task performance. Without any task-specific fine-tuning, GPT-2 could perform reading comprehension, summarization, translation, and question answering simply by framing these tasks as language modeling.

The Zero-Shot ParadigmLink Copied

The key insight was that many NLP tasks can be expressed as conditional text generation. Rather than training separate models for each task, you can prompt a language model with a description of the task:

Question answering: "Q: What is the capital of France? A:"
Summarization: "Article: [article text] TL;DR:"
Translation: "English: Hello, how are you? French:"

The language model, having seen similar patterns in its training data, learns to complete these prompts appropriately.

In[14]:

Code

# Demonstrate zero-shot prompting patterns
zero_shot_examples = {
    "reading_comprehension": """
Passage: The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars 
in Paris, France. It is named after the engineer Gustave Eiffel, whose company 
designed and built the tower.

Question: Who designed the Eiffel Tower?
Answer:""",
    "summarization": """
Article: Scientists have discovered a new species of deep-sea fish in the 
Mariana Trench. The fish, nicknamed the "ghost fish" due to its translucent 
appearance, was found at depths exceeding 8,000 meters. Researchers believe 
the fish has adapted unique biological mechanisms to survive the extreme 
pressure at such depths.

TL;DR:""",
    "translation": """
English: The weather is beautiful today.
French:""",
    "sentiment": """
Review: This movie was absolutely incredible! The acting was superb and the 
plot kept me on the edge of my seat the entire time.
Sentiment:""",
}

# Demonstrate zero-shot prompting patterns
zero_shot_examples = {
    "reading_comprehension": """
Passage: The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars 
in Paris, France. It is named after the engineer Gustave Eiffel, whose company 
designed and built the tower.

Question: Who designed the Eiffel Tower?
Answer:""",
    "summarization": """
Article: Scientists have discovered a new species of deep-sea fish in the 
Mariana Trench. The fish, nicknamed the "ghost fish" due to its translucent 
appearance, was found at depths exceeding 8,000 meters. Researchers believe 
the fish has adapted unique biological mechanisms to survive the extreme 
pressure at such depths.

TL;DR:""",
    "translation": """
English: The weather is beautiful today.
French:""",
    "sentiment": """
Review: This movie was absolutely incredible! The acting was superb and the 
plot kept me on the edge of my seat the entire time.
Sentiment:""",
}

Out[15]:

Console

Zero-Shot Prompt Examples:
============================================================

READING COMPREHENSION:
----------------------------------------
Passage: The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars 
in Paris, France. It is named after the engineer Gustave Eiffel, whose company 
designed and built the tower.

Question: Who designed the Eiffel Tower?
Answer:


SUMMARIZATION:
----------------------------------------
Article: Scientists have discovered a new species of deep-sea fish in the 
Mariana Trench. The fish, nicknamed the "ghost fish" due to its translucent 
appearance, was found at depths exceeding 8,000 meters. Researchers believe 
the fish has adapted unique biological mechanisms to survive the extreme 
pressure at such depths.

TL;DR:


TRANSLATION:
----------------------------------------
English: The weather is beautiful today.
French:


SENTIMENT:
----------------------------------------
Review: This movie was absolutely incredible! The acting was superb and the 
plot kept me on the edge of my seat the entire time.
Sentiment:

Each prompt demonstrates a different task expressed as conditional text generation. The reading comprehension prompt provides context then asks a question. The summarization prompt uses "TL;DR:" as a signal the model learned from internet forums. The translation prompt establishes a pattern the model should continue. These formats work because WebText contained millions of similar patterns.

Benchmark ResultsLink Copied

OpenAI evaluated GPT-2 on various benchmarks in zero-shot mode, comparing against supervised baselines that were explicitly trained for each task.

Out[16]:

Visualization

Bar chart comparing perplexity scores of GPT-2 model sizes on multiple benchmarks. — GPT-2 zero-shot performance compared to supervised baselines on language modeling benchmarks. Larger models consistently perform better, with GPT-2 XL achieving state-of-the-art on several datasets.

The relationship between model size and perplexity follows a consistent pattern across benchmarks. Let's examine this scaling behavior more closely:

Out[17]:

Visualization

Log-log scatter plot showing perplexity decreasing as model parameters increase across four benchmarks. — Log-log plot of model parameters versus perplexity reveals approximate power-law scaling. Each doubling of parameters yields a consistent perplexity reduction, a pattern that would later be formalized in neural scaling laws.

On a log-log plot, the near-linear relationship between parameters and perplexity suggests power-law scaling. This empirical observation, that performance improves predictably with scale, would later be formalized in the "Scaling Laws for Neural Language Models" paper by Kaplan et al. (2020). The consistency across diverse benchmarks indicates that scaling benefits are not dataset-specific but reflect genuine improvements in language modeling capability.

Several patterns emerged from these evaluations:

Consistent scaling: Larger models achieved lower perplexity across all benchmarks. GPT-2 XL achieved state-of-the-art on Penn Treebank despite never being explicitly trained on it.
Domain transfer: Performance on WikiText (derived from Wikipedia) was strong despite Wikipedia being excluded from training, suggesting the model learned transferable language patterns.
Task emergence: Capabilities like reading comprehension and translation emerged without task-specific training, purely from scale and diverse pretraining.

Limitations of Zero-ShotLink Copied

Zero-shot performance, while impressive, came with clear limitations:

Inconsistency: The model sometimes generated plausible-sounding but incorrect answers, particularly for factual questions.
Sensitivity to prompting: Small changes in prompt format could significantly affect performance.
Task ambiguity: Without examples, the model sometimes misinterpreted what was being asked.

These limitations motivated the subsequent exploration of few-shot learning, where providing a handful of examples dramatically improved task performance.

Generation QualityLink Copied

GPT-2's text generation quality captured public attention more than its benchmark scores. The model could produce coherent, contextually appropriate text that often fooled readers into thinking it was human-written.

Coherent Long-Form GenerationLink Copied

Unlike previous language models that quickly devolved into nonsense, GPT-2 maintained coherence over paragraphs and pages. Let's examine what enables this.

In[18]:

Code

def top_k_sampling(
    logits: torch.Tensor, k: int = 50, temperature: float = 1.0
) -> int:
    """Sample from top-k logits with temperature scaling."""
    # Apply temperature
    logits = logits / temperature

    # Get top-k logits and indices
    top_k_logits, top_k_indices = torch.topk(logits, k)

    # Convert to probabilities
    probs = F.softmax(top_k_logits, dim=-1)

    # Sample from the distribution
    sampled_idx = torch.multinomial(probs, num_samples=1)

    return top_k_indices[sampled_idx].item()


def nucleus_sampling(
    logits: torch.Tensor, p: float = 0.9, temperature: float = 1.0
) -> int:
    """Sample from nucleus (top-p) distribution."""
    # Apply temperature
    logits = logits / temperature

    # Sort logits and get cumulative probabilities
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

    # Remove tokens outside the nucleus
    sorted_indices_to_remove = cumulative_probs > p
    # Keep at least one token
    sorted_indices_to_remove[0] = False

    # Set removed logits to -inf
    sorted_logits[sorted_indices_to_remove] = float("-inf")

    # Convert to probabilities and sample
    probs = F.softmax(sorted_logits, dim=-1)
    sampled_idx = torch.multinomial(probs, num_samples=1)

    return sorted_indices[sampled_idx].item()

def top_k_sampling(
    logits: torch.Tensor, k: int = 50, temperature: float = 1.0
) -> int:
    """Sample from top-k logits with temperature scaling."""
    # Apply temperature
    logits = logits / temperature

    # Get top-k logits and indices
    top_k_logits, top_k_indices = torch.topk(logits, k)

    # Convert to probabilities
    probs = F.softmax(top_k_logits, dim=-1)

    # Sample from the distribution
    sampled_idx = torch.multinomial(probs, num_samples=1)

    return top_k_indices[sampled_idx].item()


def nucleus_sampling(
    logits: torch.Tensor, p: float = 0.9, temperature: float = 1.0
) -> int:
    """Sample from nucleus (top-p) distribution."""
    # Apply temperature
    logits = logits / temperature

    # Sort logits and get cumulative probabilities
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

    # Remove tokens outside the nucleus
    sorted_indices_to_remove = cumulative_probs > p
    # Keep at least one token
    sorted_indices_to_remove[0] = False

    # Set removed logits to -inf
    sorted_logits[sorted_indices_to_remove] = float("-inf")

    # Convert to probabilities and sample
    probs = F.softmax(sorted_logits, dim=-1)
    sampled_idx = torch.multinomial(probs, num_samples=1)

    return sorted_indices[sampled_idx].item()

The quality of generated text depends heavily on the sampling strategy. Pure random sampling from the full distribution produces incoherent text because low-probability tokens occasionally get selected. Top-k and nucleus sampling restrict generation to high-probability tokens, dramatically improving coherence.

Out[19]:

Visualization

Bar chart showing top-k sampling selecting the 5 highest probability tokens. — Top-k sampling keeps the k highest probability tokens.

Bar chart showing nucleus sampling selecting tokens until cumulative probability reaches 0.9. — Nucleus (top-p) sampling keeps tokens until cumulative probability exceeds p.

Temperature and CreativityLink Copied

The temperature parameter provides a tunable tradeoff between coherence and creativity. Temperature scales the logits before applying softmax, controlling how "peaked" or "flat" the resulting probability distribution becomes.

The temperature-scaled softmax computes the probability of selecting token $i$ as:

P(x_i) = \frac{\exp(z_i / T)}{\sum_{j=1}^{V} \exp(z_j / T)}

where:

$P(x_i)$ : the probability of selecting token $i$
$z_i$ : the raw logit (unnormalized score) for token $i$ from the model's output layer
$T$ : the temperature parameter, a positive scalar
$V$ : the vocabulary size (total number of possible tokens)
$\exp(\cdot)$ : the exponential function

When $T = 1$ , this reduces to standard softmax. As $T \to 0$ , the distribution becomes increasingly peaked around the maximum logit (approaching argmax). As $T \to \infty$ , the distribution approaches uniform, giving all tokens equal probability regardless of their logits. This happens because dividing by a large $T$ compresses all logit differences toward zero.

Out[20]:

Visualization

Bar chart showing how probability distributions change across different temperature values. — Effect of temperature on probability distribution. Lower temperatures concentrate probability on the top token, while higher temperatures approach uniform distribution.

At temperature 0.3, the model almost always selects the highest-probability token, producing repetitive but grammatically correct text. At temperature 1.0 (the default), it balances variety with coherence. At temperature 2.0, even low-probability tokens have significant chance of selection, leading to creative but potentially nonsensical output.

Sample GenerationLink Copied

Let's demonstrate generation using a pretrained GPT-2 model from Hugging Face:

In[21]:

Code

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pretrained GPT-2 (smallest model for efficiency)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()


def generate_text(
    prompt: str,
    max_new_tokens: int = 50,
    temperature: float = 0.8,
    top_p: float = 0.9,
) -> str:
    """Generate text continuation using GPT-2."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pretrained GPT-2 (smallest model for efficiency)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()


def generate_text(
    prompt: str,
    max_new_tokens: int = 50,
    temperature: float = 0.8,
    top_p: float = 0.9,
) -> str:
    """Generate text continuation using GPT-2."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

Out[22]:

Console

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

GPT-2 Text Generation Examples
============================================================

Prompt: The future of artificial intelligence
----------------------------------------

Generated: may not be fully defined until it is done with humans, and AI should continue to evolve in a way that makes it more accessible to people.

Image via Shutterstock


Prompt: In a small village nestled between mountains,
----------------------------------------

Generated: I've been on a pilgrimage to the village of Shakhriya for more than a year, in the midst of which a family of refugees has been trying to make it to the city. I've been in contact with some of them, many of them are in their twenties or thirties


Prompt: Scientists recently discovered that
----------------------------------------

Generated: the brain of a man who was diagnosed with autism has a more complex neural pathway, making it easier to treat autism.

"In autism there is a large network of pathways involved," said study co-author, Dr. J. Peter Huygens, professor of neuroscience at the University

The generated text demonstrates GPT-2's ability to maintain topical coherence and grammatical correctness. It continues prompts in contextually appropriate ways, though close inspection often reveals subtle errors in logic or factual accuracy.

Limitations and ImpactLink Copied

GPT-2 introduced capabilities that transformed expectations for language models, but it also revealed fundamental challenges that persist in larger models today.

Factual ReliabilityLink Copied

GPT-2 generates fluent text that can contain fabricated facts presented with confidence. The model has no mechanism to verify statements against ground truth; it simply produces statistically likely continuations. This creates a dangerous asymmetry: the text sounds authoritative but may be wrong. For knowledge-intensive tasks like medical advice or legal guidance, GPT-2's generations are unreliable without external verification. This limitation catalyzed research into retrieval-augmented generation and fact-checking systems.

Bias and ToxicityLink Copied

The WebText training data, while filtered for quality through Reddit upvotes, inherited biases from its source. GPT-2 can generate text reflecting stereotypes, political biases, and offensive content present in its training distribution. OpenAI's decision to stage the model's release was partly motivated by concerns about automated generation of targeted harassment and misinformation. Subsequent work on RLHF (reinforcement learning from human feedback) and constitutional AI directly addresses these safety concerns.

What GPT-2 EnabledLink Copied

Despite these limitations, GPT-2 fundamentally changed the field:

Scale as a strategy: GPT-2 demonstrated that simply making models larger could yield qualitatively new capabilities. This insight drove the push to GPT-3, PaLM, and other massive models.
Zero-shot learning: The idea that a single model could attempt many tasks without fine-tuning opened research into prompting, in-context learning, and instruction tuning.
Generation quality: GPT-2 set a new bar for text generation coherence, making language model outputs useful for drafting, brainstorming, and creative applications.
Open research: After staged release, OpenAI eventually published the full GPT-2 model. This enabled extensive research into interpretability, safety, and applications.

The transition from GPT-1 to GPT-2 established the scaling paradigm that would define the next several years of language model development: more parameters, more data, more compute. Each increment revealed new capabilities that smaller models lacked.

SummaryLink Copied

GPT-2 demonstrated that scaling autoregressive language models yields emergent capabilities beyond next-token prediction. The key contributions include:

Model family: Four sizes from 117M to 1.5B parameters, enabling systematic study of scaling behavior while keeping architecture constant
Architectural refinements: Pre-normalization for stable training, GELU activation for smoother gradients, and modified initialization to control variance across deep networks
WebText training: 40GB of text from Reddit-filtered web pages, providing diverse, naturally-occurring task demonstrations without explicit supervision
Zero-shot capabilities: Reading comprehension, summarization, and translation emerged from pure language modeling, achieved through clever prompting
Generation quality: Coherent multi-paragraph text generation that, with appropriate sampling strategies, could pass for human-written

GPT-2 marked the transition from fine-tuned specialists to general-purpose language models. Its limitations, particularly in factual reliability and bias, remain active research areas. But its core insight, that scale enables emergence, fundamentally reshaped expectations for what language models could become. The path to GPT-3 and beyond was now clear: continue scaling, and capabilities would follow.

Key ParametersLink Copied

When working with GPT-2 or implementing similar architectures, these parameters most directly affect model behavior:

Architecture Parameters:

hidden_size (768 to 1600): The dimensionality of token representations. Larger values increase model capacity but quadratically increase attention computation. GPT-2 variants use 768, 1024, 1280, and 1600.
num_layers (12 to 48): The number of stacked transformer blocks. More layers enable deeper reasoning but increase memory linearly. Typical values follow powers of 12.
num_heads (12 to 25): The number of parallel attention heads. Must evenly divide hidden_size. More heads allow attending to different relationship types simultaneously.
max_position (1024): Maximum sequence length the model can process. Longer contexts enable conditioning on more information but increase memory quadratically with sequence length.

Generation Parameters:

temperature (0.0 to 2.0): Controls randomness in sampling. Values below 1.0 sharpen the distribution (more deterministic), values above 1.0 flatten it (more random). Common choices: 0.7 for focused text, 1.0 for balanced, 1.2+ for creative writing.
top_k (1 to 100): Restricts sampling to the k highest-probability tokens. Lower values increase coherence but reduce diversity. Common choices: 40-50 for general use, lower for factual tasks.
top_p (0.0 to 1.0): Nucleus sampling threshold. Keeps tokens until cumulative probability exceeds p. Adapts to the shape of each distribution. Common choices: 0.9-0.95 for balanced generation.
max_new_tokens: Number of tokens to generate. Longer generations risk coherence degradation. Consider the task: summaries need fewer tokens than stories.

Training Parameters:

dropout (0.0 to 0.3): Applied to attention weights and feed-forward outputs during training. GPT-2 uses 0.1. Higher values prevent overfitting on small datasets but slow convergence.
weight_decay (0.0 to 0.1): L2 regularization strength. GPT-2 uses 0.01. Prevents weights from growing too large, improving generalization.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about GPT-2's architecture, training, and zero-shot capabilities.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{gpt2scalinglanguagemodelsforzeroshotlearning, author = {Michael Brenndoerfer}, title = {GPT-2: Scaling Language Models for Zero-Shot Learning}, year = {2025}, url = {https://mbrenndoerfer.com/writing/gpt-2-scaling-language-models-zero-shot-learning}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). GPT-2: Scaling Language Models for Zero-Shot Learning. Retrieved from https://mbrenndoerfer.com/writing/gpt-2-scaling-language-models-zero-shot-learning

MLAAcademic

Michael Brenndoerfer. "GPT-2: Scaling Language Models for Zero-Shot Learning." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/gpt-2-scaling-language-models-zero-shot-learning>.

CHICAGOAcademic

Michael Brenndoerfer. "GPT-2: Scaling Language Models for Zero-Shot Learning." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/gpt-2-scaling-language-models-zero-shot-learning.

HARVARDAcademic

Michael Brenndoerfer (2025) 'GPT-2: Scaling Language Models for Zero-Shot Learning'. Available at: https://mbrenndoerfer.com/writing/gpt-2-scaling-language-models-zero-shot-learning (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). GPT-2: Scaling Language Models for Zero-Shot Learning. https://mbrenndoerfer.com/writing/gpt-2-scaling-language-models-zero-shot-learning

Direct link:

https://mbrenndoerfer.com/writing/gpt-2-scaling-language-models-zero-shot-learning

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

GPT-2: Scaling Language Models for Zero-Shot Learning

GPT-2Link Copied

Model Sizes: A Family of ModelsLink Copied

Architectural Changes from GPT-1Link Copied

Pre-NormalizationLink Copied

GELU ActivationLink Copied

Other ModificationsLink Copied

WebText: Learning from the InternetLink Copied

Zero-Shot Task PerformanceLink Copied

The Zero-Shot ParadigmLink Copied

Benchmark ResultsLink Copied

Limitations of Zero-ShotLink Copied

Generation QualityLink Copied

Coherent Long-Form GenerationLink Copied

Temperature and CreativityLink Copied

Sample GenerationLink Copied

Limitations and ImpactLink Copied

Factual ReliabilityLink Copied

Bias and ToxicityLink Copied

What GPT-2 EnabledLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Repetition Penalties: Preventing Loops in Language Model Generation

Constrained Decoding: Grammar-Guided Generation for Structured LLM Output

Autoregressive Generation: How GPT Generates Text Token by Token

Stay updated