Qwen Architecture: Alibaba's Multilingual LLM Design

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Deep dive into Qwen's architectural innovations including GQA, SwiGLU activation, and multilingual tokenization. Learn how Qwen optimizes for Chinese and English performance.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Qwen ArchitectureLink Copied

IntroductionLink Copied

The Qwen (通义千问, Tongyi Qianwen) model family from Alibaba Cloud combines proven architectural innovations with design choices optimized for both Chinese and English performance. First released in August 2023, Qwen models showed that careful attention to tokenization, training data curation, and architectural refinements could produce models rivaling or exceeding Western counterparts, particularly for Asian language tasks.

This chapter examines the architectural decisions, training methodology, and design philosophy that distinguish Qwen from other LLaMA-derived models. We'll explore how Qwen adapts the decoder-only Transformer architecture with specific optimizations for multilingual capabilities, how its training approach balances efficiency with quality, and why these choices matter for applications requiring strong performance across multiple languages.

Chapter roadmap: We'll start by examining Qwen's architectural foundations and how they build upon LLaMA's innovations (Section 2). Then we'll explore the tokenization strategy critical to multilingual performance (Section 3). After understanding the architecture and tokenization, we'll look at training methodology and data composition (Section 4). We'll implement key components to solidify understanding (Section 5), examine the Qwen model family and variants (Section 6), and conclude by discussing limitations and impact (Section 7).

Architecture OverviewLink Copied

Before diving into specific components, let's understand how Qwen fits into the modern LLM landscape and what distinguishes it architecturally.

Foundation: Decoder-Only TransformerLink Copied

Out[2]:

Visualization

Diagram showing Qwen's decoder-only transformer architecture with token inputs flowing through embedding layer, N stacked transformer blocks, and output projection. — High-level Qwen architecture showing the flow from token input through embedding, transformer blocks, and final output projection. Qwen shares the decoder-only foundation with LLaMA but introduces specific modifications.

Qwen adopts the decoder-only Transformer architecture, following the successful pattern established by GPT and refined by LLaMA. Like its predecessors, Qwen uses:

Causal (left-to-right) attention masking for autoregressive generation
Single unified architecture for all tasks
Input and output sharing the same embedding space

Key Architectural DifferencesLink Copied

Qwen combines proven innovations from LLaMA with additional refinements:

Architectural comparison between LLaMA and Qwen. Key differences include QKV bias, larger vocabulary, and universal GQA.

Component	LLaMA	Qwen	Rationale
Normalization	RMSNorm (pre-norm)	RMSNorm (pre-norm)	Training stability
Activation	SwiGLU	SwiGLU	Model quality
Positional Encoding	RoPE	RoPE	Length extrapolation
Attention	Multi-head (7B) / GQA (70B)	GQA (all sizes)	Memory efficiency
Attention Bias	No bias	QKV bias	Improved attention quality
Vocabulary Size	32,000	151,936	Multilingual optimization
Embedding Tying	Tied	Untied	Better output distribution

The most notable differences are the addition of bias terms in attention projections, the much larger vocabulary optimized for Chinese, and the use of Grouped-Query Attention across all model sizes.

Core Architectural ComponentsLink Copied

This section builds your understanding of Qwen's key innovations step by step. We'll start with the attention mechanism, where Qwen makes its most impactful efficiency choice, then explore the feed-forward network and positional encoding. For each component, we'll first build intuition about the problem being solved, then examine the mathematical formulation, and finally see it in code.

Grouped-Query AttentionLink Copied

To understand why Qwen uses Grouped-Query Attention (GQA), we first need to understand the memory bottleneck in transformer inference.

The problem: KV cache explosion

During autoregressive generation, transformers must store the key (K) and value (V) vectors for all previously generated tokens. This "KV cache" allows the model to attend to its full history without recomputing everything at each step. For a model with $L$ layers, $h$ attention heads, head dimension $d_h$ , and sequence length $n$ , the cache requires:

\text{KV Cache Memory} = 2 \times L \times h \times n \times d_h \times \text{bytes\_per\_element}

For Qwen-72B generating a 4096-token sequence in FP16, this exceeds 40GB just for the cache. The insight behind GQA is that not every query head needs its own unique key-value representation, so we can share KV heads across groups of query heads to dramatically reduce this memory burden.

Out[3]:

Visualization

Line plot showing KV cache memory in GB versus sequence length for MHA (1:1), GQA 4:1, and GQA 8:1 configurations. — KV cache memory grows linearly with sequence length. GQA with higher ratios (8:1 in Qwen-72B) dramatically reduces memory requirements, enabling longer context windows within the same memory budget.

The solution: sharing key-value heads

While LLaMA-7B uses standard multi-head attention (MHA), Qwen employs Grouped-Query Attention (GQA) across all model sizes. GQA strikes a balance between the quality of MHA and the efficiency of Multi-Query Attention (MQA).

Out[4]:

Visualization

Three diagrams comparing MHA, MQA, and GQA attention patterns showing how keys and values are shared across query heads. — Comparison of attention mechanisms: Multi-Head Attention uses separate K,V heads per query head, Multi-Query shares one K,V across all queries, and Grouped-Query Attention groups query heads to share K,V heads. Qwen uses GQA for all model sizes.

In standard multi-head attention, each query head has its own dedicated key and value heads. GQA groups multiple query heads together to share a single set of key-value heads. This reduces memory requirements while preserving most of the representational capacity.

Understanding the attention variants

Before diving into the formulas, let's build intuition about the three attention variants:

Multi-Head Attention (MHA): The original design. Each query head gets its own private key-value pair. Maximum expressiveness, but maximum memory cost. Think of it as giving each employee their own filing cabinet.
Multi-Query Attention (MQA): The extreme efficiency approach. All query heads share a single key-value pair. Minimal memory, but information gets compressed. Like having everyone share one filing cabinet.
Grouped-Query Attention (GQA): The balanced solution. Small groups of query heads share key-value pairs. Qwen uses this because it retains most of MHA's quality while achieving most of MQA's efficiency gains.

The mathematical formulation

Now we can understand the formal definition. The overall GQA computation concatenates the outputs from all attention heads and projects them back to the model dimension:

\text{GQA}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

where:

$Q, K, V$ : the query, key, and value tensors derived from the input hidden states
$\text{head}_i$ : the output of the $i$ -th attention head, capturing what information token $i$ should attend to
$h$ : the total number of query heads (e.g., 32 for Qwen-7B, 64 for Qwen-72B)
$W^O$ : the output projection matrix of shape $(h \cdot d_{\text{head}}, d_{\text{model}})$ that recombines head outputs
$\text{Concat}$ : concatenation along the head dimension, stacking all head outputs together

Each individual head computes attention using its query projection but shares key-value projections with other heads in its group. This is where the efficiency gain comes from:

\text{head}_i = \text{Attention}(Q_i W^Q_i, K_{g(i)} W^K_{g(i)}, V_{g(i)} W^V_{g(i)})

where:

$Q_i$ : the query input for head $i$ (unique to each head)
$W^Q_i$ : the query projection matrix for head $i$ (unique to each head)
$g(i)$ : a grouping function that maps query head $i$ to its corresponding KV group index
$K_{g(i)}, V_{g(i)}$ : the key and value inputs for the group containing head $i$ (shared within group)
$W^K_{g(i)}, W^V_{g(i)}$ : the shared key and value projection matrices for that group

How grouping works in practice

The grouping function $g(i)$ determines which query heads share KV heads. For a model with $H_Q$ query heads and $H_{KV}$ key-value heads:

g(i) = \left\lfloor \frac{i \cdot H_{KV}}{H_Q} \right\rfloor

Let's trace through two concrete examples:

Qwen-7B: 32 query heads, 32 KV heads (ratio 1:1). Here $g(i) = i$ , so each query head has its own KV head. This is equivalent to standard MHA, the safe choice for a smaller model.
Qwen-72B: 64 query heads, 8 KV heads (ratio 8:1). Every 8 consecutive query heads share the same KV head. Query heads 0-7 share KV head 0, heads 8-15 share KV head 1, and so on. This provides an 8× reduction in KV cache memory.

Why this works

The intuition for why grouping works is that different query heads often look for similar patterns in the keys and values. By forcing them to share, we impose a useful regularization. Empirically, researchers found that models with GQA ratios up to 8:1 show minimal quality degradation while achieving substantial memory and speed improvements.

Summary of advantages

Reduced KV cache memory: Fewer KV heads means smaller cache during generation (8× reduction for Qwen-72B)
Faster inference: Less memory bandwidth needed to load KV cache, directly improving tokens/second
Maintained quality: Grouping preserves most of MHA's representational capacity through shared but still diverse key-value representations

Attention with BiasLink Copied

Having understood how Qwen organizes its attention heads, we now turn to a subtler design choice: whether to include bias terms in the attention projections.

The design question

When projecting hidden states to queries, keys, and values, we have a choice:

\text{Without bias:} \quad Q = XW^Q \qquad \text{With bias:} \quad Q = XW^Q + b^Q

LLaMA follows the trend of removing all bias terms, arguing that the weight matrices alone have sufficient capacity and that removing bias simplifies the model. Qwen takes the opposite stance, adding bias to the Query, Key, and Value projections.

Why bias might help

The intuition is subtle but important. Bias terms provide a learned "default" activation that doesn't depend on the input. In attention, this means:

Queries can have a baseline direction they search in, independent of the current token
Keys can have a baseline "signature" that makes certain patterns more or less likely to match
Values can contribute baseline information even before considering specific content

Think of it like this: without bias, attention patterns must emerge entirely from token-token interactions. With bias, the model can learn that "queries from this head should generally attend more to earlier positions" or "values should include this baseline information regardless of content."

The parameter cost

This flexibility comes at minimal cost. For Qwen-7B:

\text{Bias parameters per layer} = 3 \times d_{\text{model}} = 3 \times 4096 = 12{,}288

Compared to the total attention parameters per layer (~67M), this is about 0.02%. Qwen's researchers found this negligible overhead provides measurable quality improvements, particularly in tasks requiring nuanced attention patterns.

Implementation

Unlike LLaMA which removes all bias terms from linear projections, Qwen adds bias to the Query, Key, and Value projections:

In[5]:

Code

import torch
import torch.nn as nn
import torch.nn.functional as F


class QwenAttention(nn.Module):
    """Qwen attention with QKV bias and optional GQA"""

    def __init__(
        self,
        hidden_size: int,
        num_attention_heads: int,
        num_kv_heads: int = None,
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_heads = num_attention_heads
        self.num_kv_heads = num_kv_heads or num_attention_heads
        self.head_dim = hidden_size // num_attention_heads
        self.num_kv_groups = self.num_heads // self.num_kv_heads

        # Qwen uses bias=True for Q, K, V projections
        self.q_proj = nn.Linear(
            hidden_size, self.num_heads * self.head_dim, bias=True
        )
        self.k_proj = nn.Linear(
            hidden_size, self.num_kv_heads * self.head_dim, bias=True
        )
        self.v_proj = nn.Linear(
            hidden_size, self.num_kv_heads * self.head_dim, bias=True
        )
        self.o_proj = nn.Linear(
            self.num_heads * self.head_dim, hidden_size, bias=False
        )

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len, _ = hidden_states.shape

        # Project to Q, K, V with bias
        query = self.q_proj(hidden_states)
        key = self.k_proj(hidden_states)
        value = self.v_proj(hidden_states)

        # Reshape for attention
        query = query.view(batch_size, seq_len, self.num_heads, self.head_dim)
        key = key.view(batch_size, seq_len, self.num_kv_heads, self.head_dim)
        value = value.view(
            batch_size, seq_len, self.num_kv_heads, self.head_dim
        )

        # Transpose for batch matrix multiplication
        query = query.transpose(1, 2)  # (batch, num_heads, seq_len, head_dim)
        key = key.transpose(1, 2)
        value = value.transpose(1, 2)

        # Expand KV for GQA if needed
        if self.num_kv_groups > 1:
            key = key.repeat_interleave(self.num_kv_groups, dim=1)
            value = value.repeat_interleave(self.num_kv_groups, dim=1)

        # Scaled dot-product attention
        scale = self.head_dim**-0.5
        attn_weights = torch.matmul(query, key.transpose(-2, -1)) * scale

        # Causal mask
        causal_mask = torch.triu(
            torch.ones(seq_len, seq_len, device=hidden_states.device)
            * float("-inf"),
            diagonal=1,
        )
        attn_weights = attn_weights + causal_mask

        attn_weights = F.softmax(attn_weights, dim=-1)
        attn_output = torch.matmul(attn_weights, value)

        # Reshape and project output
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(batch_size, seq_len, -1)

        return self.o_proj(attn_output)


# Test the attention module
hidden_size = 4096
num_heads = 32
num_kv_heads = 32  # Qwen-7B uses 1:1 ratio

attention = QwenAttention(hidden_size, num_heads, num_kv_heads)
x = torch.randn(2, 64, hidden_size)
output = attention(x)

import torch
import torch.nn as nn
import torch.nn.functional as F


class QwenAttention(nn.Module):
    """Qwen attention with QKV bias and optional GQA"""

    def __init__(
        self,
        hidden_size: int,
        num_attention_heads: int,
        num_kv_heads: int = None,
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_heads = num_attention_heads
        self.num_kv_heads = num_kv_heads or num_attention_heads
        self.head_dim = hidden_size // num_attention_heads
        self.num_kv_groups = self.num_heads // self.num_kv_heads

        # Qwen uses bias=True for Q, K, V projections
        self.q_proj = nn.Linear(
            hidden_size, self.num_heads * self.head_dim, bias=True
        )
        self.k_proj = nn.Linear(
            hidden_size, self.num_kv_heads * self.head_dim, bias=True
        )
        self.v_proj = nn.Linear(
            hidden_size, self.num_kv_heads * self.head_dim, bias=True
        )
        self.o_proj = nn.Linear(
            self.num_heads * self.head_dim, hidden_size, bias=False
        )

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len, _ = hidden_states.shape

        # Project to Q, K, V with bias
        query = self.q_proj(hidden_states)
        key = self.k_proj(hidden_states)
        value = self.v_proj(hidden_states)

        # Reshape for attention
        query = query.view(batch_size, seq_len, self.num_heads, self.head_dim)
        key = key.view(batch_size, seq_len, self.num_kv_heads, self.head_dim)
        value = value.view(
            batch_size, seq_len, self.num_kv_heads, self.head_dim
        )

        # Transpose for batch matrix multiplication
        query = query.transpose(1, 2)  # (batch, num_heads, seq_len, head_dim)
        key = key.transpose(1, 2)
        value = value.transpose(1, 2)

        # Expand KV for GQA if needed
        if self.num_kv_groups > 1:
            key = key.repeat_interleave(self.num_kv_groups, dim=1)
            value = value.repeat_interleave(self.num_kv_groups, dim=1)

        # Scaled dot-product attention
        scale = self.head_dim**-0.5
        attn_weights = torch.matmul(query, key.transpose(-2, -1)) * scale

        # Causal mask
        causal_mask = torch.triu(
            torch.ones(seq_len, seq_len, device=hidden_states.device)
            * float("-inf"),
            diagonal=1,
        )
        attn_weights = attn_weights + causal_mask

        attn_weights = F.softmax(attn_weights, dim=-1)
        attn_output = torch.matmul(attn_weights, value)

        # Reshape and project output
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(batch_size, seq_len, -1)

        return self.o_proj(attn_output)


# Test the attention module
hidden_size = 4096
num_heads = 32
num_kv_heads = 32  # Qwen-7B uses 1:1 ratio

attention = QwenAttention(hidden_size, num_heads, num_kv_heads)
x = torch.randn(2, 64, hidden_size)
output = attention(x)

Out[6]:

Console

Qwen Attention Module:
  Input shape: torch.Size([2, 64, 4096])
  Output shape: torch.Size([2, 64, 4096])

Attention configuration:
  Hidden size: 4096
  Query heads: 32
  KV heads: 32
  Head dimension: 128

Parameter breakdown:
  Total attention parameters: 67,121,152
  Bias parameters (Q, K, V): 12,288
  Bias as % of attention: 0.02%

The output shape matches the input shape, confirming the attention module correctly processes the sequence while preserving dimensions. The bias parameters add only 0.02% overhead to the attention layer, but Qwen's researchers found this small addition improves attention pattern learning. The 128-dimensional heads (4096 / 32) provide sufficient capacity for complex attention patterns. While LLaMA removes biases for simplicity and to reduce parameters, Qwen found that the small parameter increase (less than 0.1% of total) provides meaningful quality improvements.

SwiGLU Feed-Forward NetworkLink Copied

After attention aggregates information across positions, the feed-forward network (FFN) processes each position independently, adding nonlinearity and increasing the model's representational capacity. Here we explore why Qwen chose SwiGLU over simpler alternatives.

The evolution of feed-forward design

The original Transformer used a simple two-layer FFN with ReLU activation:

\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2

This works, but researchers discovered that gated variants perform better. The key insight: instead of applying a single nonlinearity, use one transformation to decide what information to pass through (the gate) and another to decide which information is available (the values).

The gating mechanism

SwiGLU introduces element-wise multiplication between two parallel branches:

\text{SwiGLU}(x) = \underbrace{\text{Swish}(xW_{\text{gate}})}_{\text{what to keep}} \odot \underbrace{(xW_{\text{up}})}_{\text{candidate values}}

Let's unpack each part:

Gate branch ( $xW_{\text{gate}}$ ): Projects the input to an intermediate dimension, then applies Swish activation. The Swish output ranges from near-zero (suppress) to unbounded positive (amplify).
Value branch ( $xW_{\text{up}}$ ): Projects the input to the same intermediate dimension. These are the "candidate" values that might pass through.
Element-wise product ( $\odot$ ): For each element, the gate controls how much of the corresponding value passes through. Near-zero gate means that value is suppressed; large positive gate means that value is amplified.

The Swish activation

Why Swish rather than ReLU for the gate? Swish provides smooth, non-monotonic gating:

\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}

Key properties that make Swish effective:

Smooth gradients: Unlike ReLU's hard cutoff at zero, Swish transitions gradually, improving gradient flow during training
Non-monotonic: For small negative inputs, Swish actually outputs small negative values before approaching zero. This allows the gate to preserve subtle negative signals
Self-gating: The $\sigma(z)$ term acts as a soft switch on the $z$ term itself

Out[7]:

Visualization

Line plot comparing ReLU (hard cutoff at zero) with Swish (smooth curve with small negative dip) activation functions. — Comparison of ReLU and Swish activation functions. Swish provides smooth gradients and exhibits non-monotonic behavior for small negative inputs, allowing subtle signals to pass before suppression. The minimum occurs around z = -1.28.

The complete MLP

After the gated transformation, we project back to the model dimension:

\text{MLP}(x) = \text{SwiGLU}(x) W_{\text{down}}

where $W_{\text{down}}$ has shape $(d_{\text{ff}}, d_{\text{model}})$ . The intermediate dimension $d_{\text{ff}}$ is typically $\approx 2.7 \times d_{\text{model}}$ rather than the standard $4\times$ . This adjustment accounts for SwiGLU having three projection matrices instead of two, keeping the total parameter count similar to a standard FFN.

Implementation

Like LLaMA, Qwen uses the SwiGLU activation function in its feed-forward layers:

In[8]:

Code

class QwenMLP(nn.Module):
    """Qwen MLP with SwiGLU activation"""

    def __init__(self, hidden_size: int, intermediate_size: int):
        super().__init__()
        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU: Swish(gate) * up, then down projection
        return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))


# Qwen-7B dimensions
hidden_size = 4096
intermediate_size = 11008  # ~2.7x hidden_size

mlp = QwenMLP(hidden_size, intermediate_size)
x = torch.randn(2, 64, hidden_size)
output = mlp(x)

class QwenMLP(nn.Module):
    """Qwen MLP with SwiGLU activation"""

    def __init__(self, hidden_size: int, intermediate_size: int):
        super().__init__()
        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU: Swish(gate) * up, then down projection
        return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))


# Qwen-7B dimensions
hidden_size = 4096
intermediate_size = 11008  # ~2.7x hidden_size

mlp = QwenMLP(hidden_size, intermediate_size)
x = torch.randn(2, 64, hidden_size)
output = mlp(x)

Out[9]:

Console

Qwen MLP (SwiGLU):
  Input shape: torch.Size([2, 64, 4096])
  Output shape: torch.Size([2, 64, 4096])

Dimensions:
  Hidden size: 4096
  Intermediate size: 11008
  Expansion ratio: 2.69x

Parameter breakdown:
  Gate projection: 45,088,768
  Up projection: 45,088,768
  Down projection: 45,088,768
  Total MLP parameters: 135,266,304

Analyzing the output

The MLP parameters dominate each transformer layer, accounting for roughly two-thirds of per-layer parameters. The 2.69x expansion ratio (rather than the standard 4x) compensates for SwiGLU's three projections instead of two, keeping total parameter count comparable to standard FFN layers while providing the gating mechanism's quality benefits.

Why gating matters

The key insight is that not all intermediate features are equally useful for every input. In a standard FFN, every intermediate neuron contributes to every output (modulated only by ReLU killing negative values). With SwiGLU, the gate branch learns when each intermediate feature should contribute, creating input-dependent computation paths. This is a form of learned sparsity that improves model quality without increasing inference cost.

RoPE: Rotary Positional EmbeddingsLink Copied

Transformers process sequences in parallel, treating each position identically. This is computationally efficient but creates a problem: how does the model know that token 5 comes before token 10? The answer lies in positional encoding, and Qwen uses Rotary Position Embeddings (RoPE), the same approach as LLaMA.

The positional encoding challenge

Early transformers used absolute positional embeddings: learn a vector for position 0, another for position 1, and so on. This has two problems:

Length limitation: The model can only handle positions it saw during training
No relative reasoning: Positions 10-15 look completely different from positions 100-105, even though the relationship (5 tokens apart) is identical

What we really want is a way to encode position such that attention scores depend on relative distance between tokens, not their absolute positions.

The rotation insight

RoPE achieves this by encoding position as a rotation in embedding space. The key insight is that when two rotations are composed, the result depends only on their angular difference:

R_m^{-1} \cdot R_n = R_{n-m}

If we rotate queries by their position and keys by their position, then when we compute attention (which involves a dot product between queries and keys), the rotation effect depends only on the relative distance.

How rotation encodes position

For each position $m$ , we construct a rotation matrix $R_m$ that rotates pairs of dimensions in the embedding space. The rotation angle depends on both the position and the dimension pair:

\theta_i = m \cdot \frac{1}{10000^{2i/d}}

where $i$ indexes the dimension pair (there are $d/2$ pairs for dimension $d$ ) and $m$ is the position. Lower-frequency rotations (larger $i$ ) encode longer-range position information; higher-frequency rotations (smaller $i$ ) encode fine-grained local position.

Out[10]:

Visualization

Heatmap showing RoPE rotation angles across sequence positions (x-axis) and dimension pairs (y-axis), with color representing the cosine of the rotation angle. — RoPE rotation patterns across positions and dimension pairs. Lower dimensions (top) rotate rapidly for fine-grained local position encoding, while higher dimensions (bottom) rotate slowly for long-range position information. The periodic patterns enable relative position encoding.

The relative position property

The key property emerges when we compute attention scores. Given a query at position $m$ and a key at position $n$ :

(R_m q)^\top (R_n k) = q^\top R_m^\top R_n k = q^\top R_{n-m} k

The attention score depends only on $R_{n-m}$ , the rotation corresponding to the relative distance $(n-m)$ . This means:

Tokens 5 positions apart always have the same relative encoding, regardless of absolute position
The model can potentially generalize to longer sequences than it was trained on
Nearby tokens have similar rotations; distant tokens have dissimilar rotations

Implementation

Qwen uses the same RoPE implementation as LLaMA for positional encoding, which encodes position through rotation of query and key vectors:

In[11]:

Code

def precompute_freqs_cis(dim: int, max_seq_len: int, theta: float = 10000.0):
    """Precompute rotary embedding frequencies"""
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: dim // 2].float() / dim))
    t = torch.arange(max_seq_len, device=freqs.device)
    freqs = torch.outer(t, freqs).float()
    return torch.polar(torch.ones_like(freqs), freqs)


def apply_rotary_emb(
    xq: torch.Tensor, xk: torch.Tensor, freqs_cis: torch.Tensor
):
    """Apply rotary embeddings to query and key tensors"""
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))

    freqs_cis = freqs_cis[: xq_.shape[1], :].unsqueeze(0).unsqueeze(2)

    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)

    return xq_out.type_as(xq), xk_out.type_as(xk)


# Demonstrate RoPE
head_dim = 128
max_len = 2048
freqs_cis = precompute_freqs_cis(head_dim, max_len)

# Sample Q and K
batch, seq_len, n_heads = 2, 64, 32
q = torch.randn(batch, seq_len, n_heads, head_dim)
k = torch.randn(batch, seq_len, n_heads, head_dim)

q_rope, k_rope = apply_rotary_emb(q, k, freqs_cis)

def precompute_freqs_cis(dim: int, max_seq_len: int, theta: float = 10000.0):
    """Precompute rotary embedding frequencies"""
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: dim // 2].float() / dim))
    t = torch.arange(max_seq_len, device=freqs.device)
    freqs = torch.outer(t, freqs).float()
    return torch.polar(torch.ones_like(freqs), freqs)


def apply_rotary_emb(
    xq: torch.Tensor, xk: torch.Tensor, freqs_cis: torch.Tensor
):
    """Apply rotary embeddings to query and key tensors"""
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))

    freqs_cis = freqs_cis[: xq_.shape[1], :].unsqueeze(0).unsqueeze(2)

    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)

    return xq_out.type_as(xq), xk_out.type_as(xk)


# Demonstrate RoPE
head_dim = 128
max_len = 2048
freqs_cis = precompute_freqs_cis(head_dim, max_len)

# Sample Q and K
batch, seq_len, n_heads = 2, 64, 32
q = torch.randn(batch, seq_len, n_heads, head_dim)
k = torch.randn(batch, seq_len, n_heads, head_dim)

q_rope, k_rope = apply_rotary_emb(q, k, freqs_cis)

Out[12]:

Console

RoPE (Rotary Positional Embeddings):
  Head dimension: 128
  Max sequence length: 2048

Before RoPE:
  Q shape: torch.Size([2, 64, 32, 128])
  K shape: torch.Size([2, 64, 32, 128])

After RoPE:
  Q shape: torch.Size([2, 64, 32, 128])
  K shape: torch.Size([2, 64, 32, 128])

RoPE encodes relative position through rotation,
enabling length extrapolation beyond training length.

Understanding the code

The implementation uses complex number arithmetic as an efficient way to apply 2D rotations. The precompute_freqs_cis function generates the rotation angles for each position and dimension pair, stored as complex numbers (cosine + i·sine). The apply_rotary_emb function then applies these rotations by treating pairs of real dimensions as complex numbers and multiplying.

This approach is mathematically equivalent to applying rotation matrices but is much more efficient on modern hardware. The key insight remains: after rotation, attention scores between queries and keys naturally encode relative position, enabling the model to reason about token relationships regardless of their absolute positions in the sequence.

Tokenization StrategyLink Copied

One of Qwen's most distinctive features is its tokenization approach, optimized for multilingual text with particular emphasis on Chinese.

Large Multilingual VocabularyLink Copied

Out[13]:

Visualization

Bar chart comparing vocabulary sizes: LLaMA 32K, Mistral 32K, GPT-4 100K, Qwen 152K — Vocabulary size comparison across major LLMs. Qwen's 151,936 token vocabulary is significantly larger, enabling better representation of Chinese characters and multilingual text.

Qwen uses a vocabulary of 151,936 tokens, nearly 5× larger than LLaMA's 32,000. This design decision has important implications:

Advantages of larger vocabulary:

Better Chinese representation: Many Chinese characters receive their own tokens rather than being split into byte-level pieces
Fewer tokens per text: Chinese text requires significantly fewer tokens, improving both efficiency and context utilization
Improved multilingual coverage: Better handling of Korean, Japanese, and other Asian languages

Trade-offs:

Larger embedding matrices: Increases model size by approximately 0.5B parameters for Qwen-7B
Sparser token distributions: Each token appears less frequently during training
Memory overhead: Larger vocabulary requires more memory for softmax computation

In[14]:

Code

# Simulated tokenization comparison
# (Actual tokenization would require the Qwen tokenizer)


def estimate_tokens_chinese(text: str, vocab_type: str) -> int:
    """Estimate token count for Chinese text"""
    char_count = len(text)

    if vocab_type == "llama":
        # LLaMA often splits Chinese chars into 2-3 tokens
        return int(char_count * 2.5)
    elif vocab_type == "qwen":
        # Qwen has dedicated Chinese character tokens
        return int(char_count * 1.2)
    return char_count


def estimate_tokens_english(text: str, vocab_type: str) -> int:
    """Estimate token count for English text"""
    word_count = len(text.split())

    # Both handle English similarly
    if vocab_type in ["llama", "qwen"]:
        return int(word_count * 1.3)
    return word_count


# Example texts
chinese_text = "人工智能正在改变我们的生活方式和工作模式"
english_text = "Artificial intelligence is transforming how we live and work"

# Estimate token counts
results = {
    "Chinese": {
        "LLaMA": estimate_tokens_chinese(chinese_text, "llama"),
        "Qwen": estimate_tokens_chinese(chinese_text, "qwen"),
    },
    "English": {
        "LLaMA": estimate_tokens_english(english_text, "llama"),
        "Qwen": estimate_tokens_english(english_text, "qwen"),
    },
}

# Simulated tokenization comparison
# (Actual tokenization would require the Qwen tokenizer)


def estimate_tokens_chinese(text: str, vocab_type: str) -> int:
    """Estimate token count for Chinese text"""
    char_count = len(text)

    if vocab_type == "llama":
        # LLaMA often splits Chinese chars into 2-3 tokens
        return int(char_count * 2.5)
    elif vocab_type == "qwen":
        # Qwen has dedicated Chinese character tokens
        return int(char_count * 1.2)
    return char_count


def estimate_tokens_english(text: str, vocab_type: str) -> int:
    """Estimate token count for English text"""
    word_count = len(text.split())

    # Both handle English similarly
    if vocab_type in ["llama", "qwen"]:
        return int(word_count * 1.3)
    return word_count


# Example texts
chinese_text = "人工智能正在改变我们的生活方式和工作模式"
english_text = "Artificial intelligence is transforming how we live and work"

# Estimate token counts
results = {
    "Chinese": {
        "LLaMA": estimate_tokens_chinese(chinese_text, "llama"),
        "Qwen": estimate_tokens_chinese(chinese_text, "qwen"),
    },
    "English": {
        "LLaMA": estimate_tokens_english(english_text, "llama"),
        "Qwen": estimate_tokens_english(english_text, "qwen"),
    },
}

Out[15]:

Console

Tokenization Efficiency Comparison (Estimated)
=======================================================

Chinese text: 人工智能正在改变我们的生活方式和工作模式
  Characters: 20
  LLaMA tokens: ~50
  Qwen tokens: ~24
  Efficiency gain: 2.1x fewer tokens

English text: Artificial intelligence is transforming how we live and work
  Words: 9
  LLaMA tokens: ~11
  Qwen tokens: ~11

Key insight: Qwen's vocabulary is optimized for Chinese,
using ~2x fewer tokens for Chinese text while maintaining
similar efficiency for English.

Out[16]:

Visualization

Grouped bar chart comparing estimated token counts for Chinese and English text between LLaMA and Qwen tokenizers. — Estimated tokenization efficiency comparison between LLaMA and Qwen. Qwen's large vocabulary provides ~2x efficiency gain for Chinese text while maintaining comparable performance for English, directly translating to better context utilization.

Byte-Level BPE with Special HandlingLink Copied

Qwen uses Byte-Pair Encoding (BPE) at the byte level, similar to GPT-4, but with additional handling for CJK (Chinese, Japanese, Korean) characters:

Out[17]:

Visualization

Flowchart showing Qwen's tokenization pipeline with CJK detection, character-level tokenization for CJK, and BPE for other text. — Qwen tokenization process: text is first processed for CJK characters, then standard BPE is applied. This hybrid approach ensures efficient handling of both Asian and Latin scripts.

The tokenization strategy ensures:

No unknown tokens: Byte-level fallback handles any input
Efficient CJK: Common characters get dedicated tokens
Balanced encoding: Similar information density across languages

Training MethodologyLink Copied

Training Data CompositionLink Copied

Qwen is trained on approximately 3 trillion tokens, significantly more than LLaMA's 1-1.4 trillion:

Out[18]:

Visualization

Pie chart showing Qwen training data distribution across different source types. — Qwen training data composition emphasizing multilingual web text with strong Chinese representation, code, and high-quality curated sources.

Key data characteristics:

Multilingual balance: Substantial Chinese content alongside English and other languages
Code inclusion: Programming languages for coding capabilities
Quality filtering: Extensive deduplication and quality scoring
Domain diversity: Academic papers, books, web text, and specialized corpora

Training ConfigurationLink Copied

In[19]:

Code

# Qwen training hyperparameters
training_config = {
    "optimizer": "AdamW",
    "adam_beta1": 0.9,
    "adam_beta2": 0.95,
    "adam_eps": 1e-8,
    "weight_decay": 0.1,
    "gradient_clip": 1.0,
    # Learning rate
    "peak_lr": 3e-4,
    "warmup_steps": 2000,
    "lr_schedule": "cosine",
    "min_lr_ratio": 0.1,
    # Precision
    "precision": "bfloat16",
    # Context length
    "context_length": 8192,  # Qwen supports longer context
    # Training tokens
    "total_tokens": "3T",
}

# Qwen training hyperparameters
training_config = {
    "optimizer": "AdamW",
    "adam_beta1": 0.9,
    "adam_beta2": 0.95,
    "adam_eps": 1e-8,
    "weight_decay": 0.1,
    "gradient_clip": 1.0,
    # Learning rate
    "peak_lr": 3e-4,
    "warmup_steps": 2000,
    "lr_schedule": "cosine",
    "min_lr_ratio": 0.1,
    # Precision
    "precision": "bfloat16",
    # Context length
    "context_length": 8192,  # Qwen supports longer context
    # Training tokens
    "total_tokens": "3T",
}

Out[20]:

Console

Qwen Training Configuration
=======================================================

Optimizer: AdamW
  β₁ = 0.9, β₂ = 0.95
  Weight decay = 0.1
  Gradient clipping = 1.0

Learning Rate Schedule:
  Peak LR = 0.0003
  Warmup = 2,000 steps
  Schedule = cosine

Training Scale:
  Context length = 8,192 tokens
  Total training tokens = 3T
  Precision = bfloat16

Qwen's training follows similar principles to LLaMA but with extended context length (8192 vs 2048) and more training tokens. The longer context is enabled through careful RoPE scaling and NTK-aware interpolation.

Context Length ExtensionLink Copied

Qwen supports 8192 token context through training, with the ability to extend further using position interpolation techniques:

Out[21]:

Visualization

Bar chart comparing context lengths: LLaMA 2K/4K, GPT-3 4K, Qwen 8K native with 32K extended. — Context length comparison across model families. Qwen's native 8K context and ability to extend to 32K+ positions through interpolation provides significant advantages for long-document processing.

Complete Model ImplementationLink Copied

Having explored each component individually, we now assemble them into a complete Qwen model. This integration reveals how the pieces fit together and provides a reference implementation you can extend for your own experiments.

The journey from components to complete model follows this path:

Configuration: Define the hyperparameters that determine model size and behavior
Normalization: Implement RMSNorm, which stabilizes training by normalizing hidden states
Transformer block: Combine attention, FFN, and normalization into the repeating unit
Full model: Stack blocks with embeddings and output projection

Model ConfigurationLink Copied

We start by defining a configuration class that captures all the hyperparameters. This makes it easy to instantiate different model sizes and ensures consistency across components:

In[22]:

Code

from dataclasses import dataclass


@dataclass
class QwenConfig:
    """Configuration for Qwen model"""

    hidden_size: int = 4096
    intermediate_size: int = 11008
    num_hidden_layers: int = 32
    num_attention_heads: int = 32
    num_kv_heads: int = 32  # For GQA; same as num_attention_heads = MHA
    vocab_size: int = 151936
    max_position_embeddings: int = 8192
    rms_norm_eps: float = 1e-6
    rope_theta: float = 10000.0
    use_qkv_bias: bool = True  # Qwen uses bias in attention


# Configurations for different Qwen sizes
QWEN_CONFIGS = {
    "1.8B": QwenConfig(
        hidden_size=2048,
        intermediate_size=5504,
        num_hidden_layers=24,
        num_attention_heads=16,
        num_kv_heads=16,
    ),
    "7B": QwenConfig(
        hidden_size=4096,
        intermediate_size=11008,
        num_hidden_layers=32,
        num_attention_heads=32,
        num_kv_heads=32,
    ),
    "14B": QwenConfig(
        hidden_size=5120,
        intermediate_size=13696,
        num_hidden_layers=40,
        num_attention_heads=40,
        num_kv_heads=40,
    ),
    "72B": QwenConfig(
        hidden_size=8192,
        intermediate_size=24576,
        num_hidden_layers=80,
        num_attention_heads=64,
        num_kv_heads=8,  # GQA with 8:1 ratio
    ),
}


def get_qwen_config(size: str) -> QwenConfig:
    return QWEN_CONFIGS[size]

from dataclasses import dataclass


@dataclass
class QwenConfig:
    """Configuration for Qwen model"""

    hidden_size: int = 4096
    intermediate_size: int = 11008
    num_hidden_layers: int = 32
    num_attention_heads: int = 32
    num_kv_heads: int = 32  # For GQA; same as num_attention_heads = MHA
    vocab_size: int = 151936
    max_position_embeddings: int = 8192
    rms_norm_eps: float = 1e-6
    rope_theta: float = 10000.0
    use_qkv_bias: bool = True  # Qwen uses bias in attention


# Configurations for different Qwen sizes
QWEN_CONFIGS = {
    "1.8B": QwenConfig(
        hidden_size=2048,
        intermediate_size=5504,
        num_hidden_layers=24,
        num_attention_heads=16,
        num_kv_heads=16,
    ),
    "7B": QwenConfig(
        hidden_size=4096,
        intermediate_size=11008,
        num_hidden_layers=32,
        num_attention_heads=32,
        num_kv_heads=32,
    ),
    "14B": QwenConfig(
        hidden_size=5120,
        intermediate_size=13696,
        num_hidden_layers=40,
        num_attention_heads=40,
        num_kv_heads=40,
    ),
    "72B": QwenConfig(
        hidden_size=8192,
        intermediate_size=24576,
        num_hidden_layers=80,
        num_attention_heads=64,
        num_kv_heads=8,  # GQA with 8:1 ratio
    ),
}


def get_qwen_config(size: str) -> QwenConfig:
    return QWEN_CONFIGS[size]

Out[23]:

Console

Qwen Model Configurations
======================================================================

Qwen-1.8B:
  Layers: 24
  Hidden size: 2,048
  Attention heads: 16
  KV heads: 16 (GQA ratio 1:1)
  Head dimension: 128
  FFN dimension: 5,504

Qwen-7B:
  Layers: 32
  Hidden size: 4,096
  Attention heads: 32
  KV heads: 32 (GQA ratio 1:1)
  Head dimension: 128
  FFN dimension: 11,008

Qwen-14B:
  Layers: 40
  Hidden size: 5,120
  Attention heads: 40
  KV heads: 40 (GQA ratio 1:1)
  Head dimension: 128
  FFN dimension: 13,696

Qwen-72B:
  Layers: 80
  Hidden size: 8,192
  Attention heads: 64
  KV heads: 8 (GQA ratio 8:1)
  Head dimension: 128
  FFN dimension: 24,576

The configuration reveals several key design decisions. Note how GQA ratio varies: Qwen-7B uses 1:1 (equivalent to MHA), while Qwen-72B uses 8:1 for significant memory savings. The intermediate size follows the ~2.7x rule we discussed for SwiGLU. The large vocabulary (151,936) is constant across all sizes, optimized for Chinese characters.

RMSNorm ImplementationLink Copied

Before assembling the transformer block, we need our normalization layer. RMSNorm (Root Mean Square Layer Normalization) is simpler than the original LayerNorm: it normalizes by the root mean square of activations without centering (subtracting the mean).

In[24]:

Code

class QwenRMSNorm(nn.Module):
    """RMSNorm as used in Qwen"""

    def __init__(self, hidden_size: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(hidden_size))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        variance = x.pow(2).mean(-1, keepdim=True)
        x = x * torch.rsqrt(variance + self.eps)
        return self.weight * x


# Test RMSNorm
norm = QwenRMSNorm(4096)
x = torch.randn(2, 64, 4096)
normalized = norm(x)

class QwenRMSNorm(nn.Module):
    """RMSNorm as used in Qwen"""

    def __init__(self, hidden_size: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(hidden_size))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        variance = x.pow(2).mean(-1, keepdim=True)
        x = x * torch.rsqrt(variance + self.eps)
        return self.weight * x


# Test RMSNorm
norm = QwenRMSNorm(4096)
x = torch.randn(2, 64, 4096)
normalized = norm(x)

Out[25]:

Console

RMSNorm:
  Input shape: torch.Size([2, 64, 4096])
  Output shape: torch.Size([2, 64, 4096])

Normalization effect:
  Input  - mean: 0.0014, std: 0.9991
  Output - mean: 0.0014, std: 1.0000

Parameters: 4,096 (learnable scale only)

RMSNorm normalizes input magnitudes while preserving the mean, unlike LayerNorm which centers around zero. The output standard deviation is close to 1.0, indicating proper normalization. With only 4,096 learnable parameters (the scale weights), RMSNorm adds minimal overhead while providing training stability.

The mathematical formulation is:

\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma

where $d$ is the hidden dimension, $\epsilon$ is a small constant for numerical stability, and $\gamma$ is the learnable scale parameter. By skipping the mean centering of LayerNorm, RMSNorm is slightly faster while empirically performing just as well.

Complete Transformer BlockLink Copied

Now we combine attention, FFN, and normalization into a single transformer block. This is the fundamental repeating unit of the model. Qwen-7B stacks 32 of these blocks.

The block follows the pre-norm architecture pattern, where normalization happens before each sub-layer rather than after. This improves training stability, especially for deep networks:

In[26]:

Code

class QwenDecoderLayer(nn.Module):
    """Single Qwen transformer decoder layer"""

    def __init__(self, config: QwenConfig):
        super().__init__()
        self.hidden_size = config.hidden_size

        # Attention with pre-norm
        self.input_layernorm = QwenRMSNorm(
            config.hidden_size, config.rms_norm_eps
        )
        self.self_attn = QwenAttention(
            config.hidden_size,
            config.num_attention_heads,
            config.num_kv_heads,
        )

        # FFN with pre-norm
        self.post_attention_layernorm = QwenRMSNorm(
            config.hidden_size, config.rms_norm_eps
        )
        self.mlp = QwenMLP(config.hidden_size, config.intermediate_size)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        # Self attention with residual
        residual = hidden_states
        hidden_states = self.input_layernorm(hidden_states)
        hidden_states = self.self_attn(hidden_states)
        hidden_states = residual + hidden_states

        # FFN with residual
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states

        return hidden_states


# Test decoder layer
config = get_qwen_config("7B")
layer = QwenDecoderLayer(config)
x = torch.randn(2, 64, config.hidden_size)
output = layer(x)

class QwenDecoderLayer(nn.Module):
    """Single Qwen transformer decoder layer"""

    def __init__(self, config: QwenConfig):
        super().__init__()
        self.hidden_size = config.hidden_size

        # Attention with pre-norm
        self.input_layernorm = QwenRMSNorm(
            config.hidden_size, config.rms_norm_eps
        )
        self.self_attn = QwenAttention(
            config.hidden_size,
            config.num_attention_heads,
            config.num_kv_heads,
        )

        # FFN with pre-norm
        self.post_attention_layernorm = QwenRMSNorm(
            config.hidden_size, config.rms_norm_eps
        )
        self.mlp = QwenMLP(config.hidden_size, config.intermediate_size)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        # Self attention with residual
        residual = hidden_states
        hidden_states = self.input_layernorm(hidden_states)
        hidden_states = self.self_attn(hidden_states)
        hidden_states = residual + hidden_states

        # FFN with residual
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states

        return hidden_states


# Test decoder layer
config = get_qwen_config("7B")
layer = QwenDecoderLayer(config)
x = torch.randn(2, 64, config.hidden_size)
output = layer(x)

Out[27]:

Console

Qwen Decoder Layer:
  Input shape: torch.Size([2, 64, 4096])
  Output shape: torch.Size([2, 64, 4096])

Parameter breakdown:
  Attention: 67,121,152 (33.2%)
  MLP: 135,266,304 (66.8%)
  Norms: 8,192 (0.0%)
  Total: 202,395,648

The parameter distribution shows MLP dominates at roughly 65% of layer parameters, with attention at about 35%. This is typical for modern transformers using SwiGLU. The normalization layers contribute negligibly to parameter count but are critical for training stability.

The code reveals the pre-norm pattern in action: each sub-layer computes residual = hidden_states before normalization, then adds the sub-layer output back (hidden_states = residual + hidden_states). This residual connection is essential. It provides a direct gradient pathway during backpropagation and allows each layer to learn a refinement of its input rather than a complete transformation.

Complete Qwen ModelLink Copied

Finally, we assemble the complete model by wrapping our transformer blocks with:

Token embeddings: Convert token IDs to dense vectors
Stacked transformer blocks: The core computation
Final normalization: Stabilize the output before prediction
Language model head: Project back to vocabulary size for next-token prediction

In[28]:

Code

class QwenModel(nn.Module):
    """Complete Qwen model"""

    def __init__(self, config: QwenConfig):
        super().__init__()
        self.config = config

        # Token embeddings (untied from output)
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)

        # Transformer layers
        self.layers = nn.ModuleList(
            [QwenDecoderLayer(config) for _ in range(config.num_hidden_layers)]
        )

        # Final normalization
        self.norm = QwenRMSNorm(config.hidden_size, config.rms_norm_eps)

        # Output projection (untied from embeddings)
        self.lm_head = nn.Linear(
            config.hidden_size, config.vocab_size, bias=False
        )

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        hidden_states = self.embed_tokens(input_ids)

        for layer in self.layers:
            hidden_states = layer(hidden_states)

        hidden_states = self.norm(hidden_states)
        logits = self.lm_head(hidden_states)

        return logits


# Create a small model for testing
test_config = QwenConfig(
    hidden_size=512,
    intermediate_size=1376,
    num_hidden_layers=4,
    num_attention_heads=8,
    num_kv_heads=8,
    vocab_size=32000,  # Smaller for testing
    max_position_embeddings=512,
)

model = QwenModel(test_config)
total_params = sum(p.numel() for p in model.parameters())

# Test forward pass
test_input = torch.randint(0, test_config.vocab_size, (2, 64))
with torch.no_grad():
    logits = model(test_input)

class QwenModel(nn.Module):
    """Complete Qwen model"""

    def __init__(self, config: QwenConfig):
        super().__init__()
        self.config = config

        # Token embeddings (untied from output)
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)

        # Transformer layers
        self.layers = nn.ModuleList(
            [QwenDecoderLayer(config) for _ in range(config.num_hidden_layers)]
        )

        # Final normalization
        self.norm = QwenRMSNorm(config.hidden_size, config.rms_norm_eps)

        # Output projection (untied from embeddings)
        self.lm_head = nn.Linear(
            config.hidden_size, config.vocab_size, bias=False
        )

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        hidden_states = self.embed_tokens(input_ids)

        for layer in self.layers:
            hidden_states = layer(hidden_states)

        hidden_states = self.norm(hidden_states)
        logits = self.lm_head(hidden_states)

        return logits


# Create a small model for testing
test_config = QwenConfig(
    hidden_size=512,
    intermediate_size=1376,
    num_hidden_layers=4,
    num_attention_heads=8,
    num_kv_heads=8,
    vocab_size=32000,  # Smaller for testing
    max_position_embeddings=512,
)

model = QwenModel(test_config)
total_params = sum(p.numel() for p in model.parameters())

# Test forward pass
test_input = torch.randint(0, test_config.vocab_size, (2, 64))
with torch.no_grad():
    logits = model(test_input)

Out[29]:

Console

Qwen Model (Test Configuration):
  Total parameters: 45,427,200

Component breakdown:
  Embeddings: 16,384,000 (36.1%)
  Layers: 12,658,688 (27.9%)
  Final norm: 512
  LM head: 16,384,000 (36.1%)

Forward pass:
  Input shape: torch.Size([2, 64])
  Output shape: torch.Size([2, 64, 32000])
  Output matches expected: True

The model successfully processes input tokens through all layers and produces vocabulary logits. With untied embeddings, both the input embedding and output LM head have separate parameters, each accounting for a significant fraction of total model size. This is particularly pronounced in Qwen due to its large vocabulary.

A note on tied vs. untied embeddings

Some models (like the original GPT-2) tie the input embedding and output projection, using the same weight matrix for both. This saves parameters but forces the embedding space to serve double duty: representing input tokens and producing output logits. Qwen untied these, allowing the input embedding to specialize for contextual representation and the output head to specialize for prediction. The cost is doubling the embedding parameters (about 0.6B for Qwen-7B), but the quality gain justifies this for large models.

Parameter Count AnalysisLink Copied

Understanding where parameters live helps with capacity planning and debugging. Let's systematically count parameters for each Qwen variant:

In[30]:

Code

def count_parameters(config: QwenConfig) -> dict:
    """Calculate parameter count for each component"""
    hidden = config.hidden_size
    intermediate = config.intermediate_size
    vocab = config.vocab_size
    n_layers = config.num_hidden_layers
    n_heads = config.num_attention_heads
    n_kv_heads = config.num_kv_heads
    head_dim = hidden // n_heads

    # Embeddings (untied, so counted separately)
    embed_params = vocab * hidden

    # Per layer: attention + MLP + norms
    # Attention: Q, K, V projections + output projection
    q_params = hidden * hidden + (hidden if config.use_qkv_bias else 0)
    kv_params = 2 * (
        hidden * (n_kv_heads * head_dim)
        + (n_kv_heads * head_dim if config.use_qkv_bias else 0)
    )
    o_params = hidden * hidden
    attn_params = q_params + kv_params + o_params

    # MLP: gate, up, down projections
    mlp_params = 3 * hidden * intermediate

    # Norms: 2 per layer
    norm_params = 2 * hidden

    layer_params = attn_params + mlp_params + norm_params
    all_layer_params = layer_params * n_layers

    # Final norm
    final_norm_params = hidden

    # LM head (untied)
    lm_head_params = hidden * vocab

    total = embed_params + all_layer_params + final_norm_params + lm_head_params

    return {
        "embeddings": embed_params,
        "per_layer": layer_params,
        "all_layers": all_layer_params,
        "final_norm": final_norm_params,
        "lm_head": lm_head_params,
        "total": total,
    }


# Calculate for each size
for size in ["1.8B", "7B", "14B", "72B"]:
    config = get_qwen_config(size)
    params = count_parameters(config)

def count_parameters(config: QwenConfig) -> dict:
    """Calculate parameter count for each component"""
    hidden = config.hidden_size
    intermediate = config.intermediate_size
    vocab = config.vocab_size
    n_layers = config.num_hidden_layers
    n_heads = config.num_attention_heads
    n_kv_heads = config.num_kv_heads
    head_dim = hidden // n_heads

    # Embeddings (untied, so counted separately)
    embed_params = vocab * hidden

    # Per layer: attention + MLP + norms
    # Attention: Q, K, V projections + output projection
    q_params = hidden * hidden + (hidden if config.use_qkv_bias else 0)
    kv_params = 2 * (
        hidden * (n_kv_heads * head_dim)
        + (n_kv_heads * head_dim if config.use_qkv_bias else 0)
    )
    o_params = hidden * hidden
    attn_params = q_params + kv_params + o_params

    # MLP: gate, up, down projections
    mlp_params = 3 * hidden * intermediate

    # Norms: 2 per layer
    norm_params = 2 * hidden

    layer_params = attn_params + mlp_params + norm_params
    all_layer_params = layer_params * n_layers

    # Final norm
    final_norm_params = hidden

    # LM head (untied)
    lm_head_params = hidden * vocab

    total = embed_params + all_layer_params + final_norm_params + lm_head_params

    return {
        "embeddings": embed_params,
        "per_layer": layer_params,
        "all_layers": all_layer_params,
        "final_norm": final_norm_params,
        "lm_head": lm_head_params,
        "total": total,
    }


# Calculate for each size
for size in ["1.8B", "7B", "14B", "72B"]:
    config = get_qwen_config(size)
    params = count_parameters(config)

Out[31]:

Console

Qwen Parameter Breakdown
======================================================================

Qwen-1.8B:
  Embeddings: 0.31B (16.9%)
  Per layer: 50.6M
  All layers: 1.21B (66.1%)
  LM head: 0.31B (16.9%)
  Total: 1.84B

Qwen-7B:
  Embeddings: 0.62B (8.1%)
  Per layer: 202.4M
  All layers: 6.48B (83.9%)
  LM head: 0.62B (8.1%)
  Total: 7.72B

Qwen-14B:
  Embeddings: 0.78B (5.5%)
  Per layer: 315.3M
  All layers: 12.61B (89.0%)
  LM head: 0.78B (5.5%)
  Total: 14.17B

Qwen-72B:
  Embeddings: 1.24B (2.0%)
  Per layer: 755.0M
  All layers: 60.40B (96.0%)
  LM head: 1.24B (2.0%)
  Total: 62.89B

Notice how the large vocabulary (151,936 tokens) significantly impacts the embedding and LM head parameters. For Qwen-7B, embeddings alone account for approximately 0.6B parameters (about 8% of total), compared to less than 3% in LLaMA-7B.

Out[32]:

Visualization

Stacked bar chart showing parameter distribution (embeddings, transformer layers, LM head) for Qwen 1.8B, 7B, 14B, and 72B models. — Parameter distribution across Qwen model sizes. The large vocabulary creates substantial embedding overhead, especially visible in smaller models. Transformer layers dominate in larger models, while the embedding + LM head fraction decreases as models scale.

Model Variants and FamilyLink Copied

Qwen Model FamilyLink Copied

Out[33]:

Visualization

Hierarchical diagram showing Qwen model variants branching from base models to specialized versions. — Qwen model family tree showing base models, chat variants, and specialized versions. Each base model has corresponding instruction-tuned (Chat) and quantized variants.

The Qwen family has expanded significantly since its initial release:

Base models (various sizes):

Qwen-1.8B, Qwen-7B, Qwen-14B, Qwen-72B
Pre-trained on diverse multilingual data
Strong foundation for fine-tuning

Chat variants:

Instruction-tuned using SFT and RLHF
Optimized for conversational use cases
Safety-aligned for deployment

Specialized variants:

Qwen-VL: Vision-language model for image understanding
Qwen-Audio: Audio understanding and generation
CodeQwen: Specialized for code generation and understanding
Qwen-Math: Enhanced mathematical reasoning

Qwen 2 ImprovementsLink Copied

Qwen 2, released in 2024, introduced several architectural refinements:

In[34]:

Code

# Qwen 2 architectural improvements
qwen2_improvements = {
    "architecture": {
        "gqa_all_sizes": "GQA used across all model sizes",
        "yarn_rope": "YaRN RoPE for better length extrapolation",
        "sliding_window": "Optional sliding window attention",
    },
    "training": {
        "more_data": "Training on 7T+ tokens",
        "better_filtering": "Improved data quality filtering",
        "longer_context": "Native 32K context in some variants",
    },
    "efficiency": {
        "flash_attention": "Native FlashAttention-2 support",
        "quantization": "Optimized for INT4/INT8 inference",
    },
}

# Qwen 2 architectural improvements
qwen2_improvements = {
    "architecture": {
        "gqa_all_sizes": "GQA used across all model sizes",
        "yarn_rope": "YaRN RoPE for better length extrapolation",
        "sliding_window": "Optional sliding window attention",
    },
    "training": {
        "more_data": "Training on 7T+ tokens",
        "better_filtering": "Improved data quality filtering",
        "longer_context": "Native 32K context in some variants",
    },
    "efficiency": {
        "flash_attention": "Native FlashAttention-2 support",
        "quantization": "Optimized for INT4/INT8 inference",
    },
}

Out[35]:

Console

Qwen 2 Improvements over Qwen 1
=======================================================

Architecture:
  • GQA used across all model sizes
  • YaRN RoPE for better length extrapolation
  • Optional sliding window attention

Training:
  • Training on 7T+ tokens
  • Improved data quality filtering
  • Native 32K context in some variants

Efficiency:
  • Native FlashAttention-2 support
  • Optimized for INT4/INT8 inference

Limitations and ImpactLink Copied

Current LimitationsLink Copied

Despite its strengths, Qwen has several notable limitations that affect its practical use:

Technical constraints:

Large vocabulary increases memory requirements
Untied embeddings add parameters without proportional quality gains in all tasks
Slower tokenization due to larger vocabulary lookup

Capability gaps:

Hallucination remains an issue, particularly for factual queries
Long-context performance degrades beyond training length despite RoPE
Mathematical reasoning, while improved, still trails specialized models

Deployment considerations:

72B model requires significant infrastructure
Quantization needed for consumer hardware deployment
Inference speed impacted by large vocabulary softmax

These limitations are being actively addressed in subsequent Qwen versions, with Qwen 2 showing particular improvements in efficiency and capability gaps.

Impact on the FieldLink Copied

Qwen's contributions extend beyond its technical specifications:

Demonstrating multilingual competence:

Proved that non-English-first models can achieve competitive performance
Showed the importance of vocabulary design for multilingual capability
Influenced tokenization strategies in subsequent models

Open-source ecosystem:

Released weights under permissive licenses (Apache 2.0 for many variants)
Provided strong baseline for Chinese NLP research
Enabled fine-tuning for specialized Chinese/multilingual applications

Architectural validation:

Confirmed benefits of GQA across model sizes
Validated attention bias addition for quality improvement
Demonstrated viability of larger vocabularies with proper training

The success of Qwen models has encouraged other organizations to invest in multilingual model development and has highlighted the importance of considering non-English languages in LLM design from the ground up.

Key Architecture ParametersLink Copied

When working with or adapting Qwen models, these parameters have the most significant impact:

Model Dimension Parameters:

hidden_size: The model's hidden dimension (2048 for 1.8B, up to 8192 for 72B). Determines representation capacity and scales quadratically with attention compute.
num_hidden_layers: Number of transformer blocks (24-80 across sizes). More layers enable deeper reasoning but increase inference latency linearly.
intermediate_size: FFN hidden dimension, typically ~2.7x hidden_size. Controls MLP capacity, which dominates parameter count.

Attention Parameters:

num_attention_heads: Query head count (16-64 across sizes). More heads enable diverse attention patterns but increase memory for attention scores.
num_kv_heads: Key-value head count for GQA. Lower ratios (e.g., 8:1 in Qwen-72B) significantly reduce KV cache memory during generation.
use_qkv_bias: Whether to include bias in attention projections (True for Qwen). Small parameter overhead for quality gains.

Sequence Parameters:

max_position_embeddings: Maximum context length (8192 native, extendable to 32K+). Affects RoPE frequency computation and KV cache size.
rope_theta: Base frequency for RoPE (10000.0 default). Higher values enable better length extrapolation.

Normalization:

rms_norm_eps: Epsilon for numerical stability in RMSNorm (1e-6). Rarely needs adjustment but critical for mixed-precision training.

Vocabulary:

vocab_size: Token vocabulary size (151,936 for Qwen). Larger vocabularies improve multilingual efficiency but increase embedding/LM head parameters significantly.

SummaryLink Copied

Qwen adapts LLaMA's foundation for multilingual, particularly Chinese-focused, applications. By combining proven innovations (RMSNorm, SwiGLU, RoPE) with targeted modifications (large vocabulary, attention bias, GQA across all sizes), Qwen achieves strong performance across both Chinese and English tasks.

Key takeaways:

Vocabulary design matters for multilingual models: Qwen's 151,936 token vocabulary, while adding parameters, enables 2x more efficient tokenization of Chinese text. This directly translates to better context utilization and reduced inference cost for Chinese applications.
Architectural refinements compound: Adding bias to attention projections and using GQA universally are small changes individually, but together they improve both quality and efficiency. The lesson: don't dismiss seemingly minor architectural modifications.
Training data composition shapes capability: Qwen's strong Chinese performance stems from deliberate training data curation, not just architectural choices. Balanced multilingual training from the start produces better results than post-hoc adaptation.
Open weights accelerate progress: By releasing model weights openly, Alibaba enabled rapid ecosystem development. The proliferation of Qwen fine-tunes and applications demonstrates the value of open research.
Model families enable diverse applications: The expansion from base Qwen to Chat, VL, Audio, and Code variants shows how a strong foundation enables specialization. Investing in base model quality pays dividends across the entire family.

Looking forward: Qwen's evolution from version 1 to 2 shows continuous refinement in architecture, training, and efficiency. The model family's success has influenced how the broader community thinks about multilingual model development, vocabulary design, and the importance of serving diverse language communities. As context lengths extend and multimodal capabilities expand, Qwen's foundation positions it well for continued evolution.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Qwen's architectural innovations and design choices.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{qwenarchitecturealibabasmultilingualllmdesign, author = {Michael Brenndoerfer}, title = {Qwen Architecture: Alibaba's Multilingual LLM Design}, year = {2025}, url = {https://mbrenndoerfer.com/writing/qwen-architecture-multilingual-llm}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Qwen Architecture: Alibaba's Multilingual LLM Design. Retrieved from https://mbrenndoerfer.com/writing/qwen-architecture-multilingual-llm

MLAAcademic

Michael Brenndoerfer. "Qwen Architecture: Alibaba's Multilingual LLM Design." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/qwen-architecture-multilingual-llm>.

CHICAGOAcademic

Michael Brenndoerfer. "Qwen Architecture: Alibaba's Multilingual LLM Design." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/qwen-architecture-multilingual-llm.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Qwen Architecture: Alibaba's Multilingual LLM Design'. Available at: https://mbrenndoerfer.com/writing/qwen-architecture-multilingual-llm (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Qwen Architecture: Alibaba's Multilingual LLM Design. https://mbrenndoerfer.com/writing/qwen-architecture-multilingual-llm

Direct link:

https://mbrenndoerfer.com/writing/qwen-architecture-multilingual-llm

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Qwen Architecture: Alibaba's Multilingual LLM Design

Qwen ArchitectureLink Copied

IntroductionLink Copied

Architecture OverviewLink Copied

Foundation: Decoder-Only TransformerLink Copied

Key Architectural DifferencesLink Copied

Core Architectural ComponentsLink Copied

Grouped-Query AttentionLink Copied

Attention with BiasLink Copied

SwiGLU Feed-Forward NetworkLink Copied

RoPE: Rotary Positional EmbeddingsLink Copied

Tokenization StrategyLink Copied

Large Multilingual VocabularyLink Copied

Byte-Level BPE with Special HandlingLink Copied

Training MethodologyLink Copied

Training Data CompositionLink Copied

Training ConfigurationLink Copied

Context Length ExtensionLink Copied

Complete Model ImplementationLink Copied

Model ConfigurationLink Copied

RMSNorm ImplementationLink Copied

Complete Transformer BlockLink Copied

Complete Qwen ModelLink Copied

Parameter Count AnalysisLink Copied

Model Variants and FamilyLink Copied

Qwen Model FamilyLink Copied

Qwen 2 ImprovementsLink Copied

Limitations and ImpactLink Copied

Current LimitationsLink Copied

Impact on the FieldLink Copied

Key Architecture ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

LLaMA Architecture: Design Philosophy and Training Efficiency

Mistral Architecture: Sliding Window Attention & Efficient LLM Design

Grouped Query Attention: Memory-Efficient LLM Inference

Stay updated