Transformer-XL: Extending Transformers to Long Sequences
Back to Writing

Transformer-XL: Extending Transformers to Long Sequences

Michael BrenndoerferNovember 2, 202516 min read3,775 wordsInteractive

A comprehensive guide to Transformer-XL, the architectural innovation that enabled transformers to handle longer sequences through segment-level recurrence and relative positional encodings. Learn how this model extended context length while maintaining efficiency and influenced modern language models.

History of Language AI Cover
Part of History of Language AI

This article is part of the free-to-read History of Language AI book

View full handbook
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

2019: Transformer-XL

By 2019, the transformer architecture had revolutionized natural language processing, enabling models like BERT and GPT to achieve remarkable performance across diverse tasks. However, researchers were encountering a fundamental limitation: standard transformers struggled with long sequences. The architecture's attention mechanism had quadratic computational complexity with respect to sequence length, and the fixed positional encodings used in the original transformer design created challenges when processing sequences longer than those seen during training. For tasks that required understanding long-range dependencies, such as document modeling, long-form text generation, or contextual understanding across extended passages, standard transformers were computationally expensive or fundamentally limited.

Researchers at Google Brain and Carnegie Mellon University, led by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov, recognized that the transformer's limitation to fixed-length context windows was a significant barrier to its application in many practical scenarios. The problem wasn't just computational efficiency, though that was important. The deeper issue was that standard transformers processed each segment independently, losing all context from previous segments. When a model encountered a long document, it would split the document into fixed-length segments, process each segment separately, and have no memory of what came before. This architectural constraint prevented the model from maintaining coherence and understanding across long contexts.

Transformer-XL, introduced in 2019, addressed these limitations through two key innovations: segment-level recurrence and relative positional encodings. The model introduced a recurrent mechanism that allowed information from previous segments to persist across segment boundaries, enabling the model to maintain context over sequences much longer than the segment length. Simultaneously, the architecture replaced absolute positional encodings with relative positional encodings, allowing the model to generalize to longer sequences than those seen during training. These innovations enabled Transformer-XL to process sequences up to several times longer than standard transformers while maintaining computational efficiency.

The impact of Transformer-XL extended well beyond the immediate improvement in sequence length handling. The model demonstrated that transformers could effectively model long-range dependencies when equipped with the right architectural modifications. The relative positional encoding scheme, in particular, would influence many subsequent transformer variants. Modern language models that handle long contexts, from GPT-3 with its longer attention windows to models like PaLM and LLaMA, build on ideas introduced in Transformer-XL. The architecture's approach to extending context length while maintaining efficiency has become a foundation for modern long-context language models.

The Problem

The transformer architecture, introduced in 2017, had revolutionized sequence modeling by replacing recurrent layers with self-attention mechanisms. This change enabled parallel processing and proved highly effective for many natural language processing tasks. However, the architecture had inherent limitations when dealing with long sequences. These limitations became increasingly problematic as researchers sought to apply transformers to tasks requiring understanding of extended contexts, such as document-level understanding, long-form text generation, or maintaining coherence across lengthy passages.

The most immediate problem was computational complexity. The self-attention mechanism in transformers requires computing attention scores between every pair of positions in the sequence. For a sequence of length nn, this requires computing an n×nn \times n attention matrix, resulting in quadratic computational complexity O(n2)O(n^2) with respect to sequence length. Memory requirements also grow quadratically, as the attention matrix must be stored during training. For sequences longer than a few hundred tokens, this computational cost becomes prohibitive. Early transformer models like BERT were limited to sequences of 512 tokens, and even increasing to 1024 tokens required significant computational resources.

Beyond computational constraints, the architecture had a more fundamental limitation related to how it handled context. Standard transformers process each training example as a fixed-length segment. When dealing with a long document or sequence, the model would split the input into separate segments of fixed length, typically 512 or 1024 tokens. Each segment was processed independently, with no information flowing between segments. This meant that tokens at the beginning of a segment had no access to context from previous segments, and tokens at the end of a segment could not influence the processing of subsequent segments.

This segmentation approach created several problems. First, the model lost all long-range dependencies that spanned segment boundaries. If an important piece of information appeared early in a document and was referenced much later, the model could not connect these references. Second, the model could not maintain coherence across segments. When generating text or processing a long document, the model's understanding reset at each segment boundary, leading to inconsistencies and loss of context. Third, the fixed segment length created an artificial boundary that didn't align with natural linguistic units like sentences or paragraphs, potentially breaking up coherent units of meaning.

The positional encoding scheme used in standard transformers created additional challenges. The original transformer architecture used fixed, sinusoidal positional encodings that were added to token embeddings. These encodings provided absolute position information, indicating where each token appeared in the sequence. However, because the positional encodings were predefined for a fixed maximum sequence length, the model could not naturally extend to sequences longer than those seen during training. If a model was trained on sequences of length 512, it would struggle when presented with sequences of length 1024, as it would encounter positional encodings it had never seen during training.

This limitation meant that extending context length required retraining the model with new positional encodings, which was computationally expensive and time-consuming. Researchers who wanted to work with longer sequences faced a difficult choice: either accept the computational cost of training with longer sequences from scratch, or work within the constraints of the fixed sequence length. Neither option was ideal for applications that needed flexible context length or that dealt with inherently long sequences.

Another challenge was that the attention mechanism, while powerful, didn't naturally distinguish between positions that were near each other versus far apart in the sequence. The model learned relative positions implicitly through attention weights, but there was no explicit mechanism to handle positional relationships. This made it harder for the model to understand local versus long-range dependencies, and to generalize positional relationships to sequences of different lengths.

The problems of limited context length and segment independence became particularly apparent in language modeling tasks. When training a language model on long documents, the model would see each segment independently. A word appearing at the beginning of a document would have no influence on how the model predicted words near the end of the document, even if they were strongly related. This limitation affected not just language modeling but any task that required maintaining context across long sequences, from document classification to long-form question answering.

Researchers recognized that these limitations were holding back the application of transformers to many practical scenarios. Real-world applications often involve documents, conversations, or contexts that extend well beyond a few hundred tokens. Academic papers, legal documents, codebases, and extended dialogues all require understanding relationships across thousands or tens of thousands of tokens. The field needed an architectural innovation that would enable transformers to handle longer sequences while maintaining computational efficiency and avoiding the artificial boundaries created by fixed-length segmentation.

The Solution

Transformer-XL addressed these fundamental limitations through two complementary innovations: segment-level recurrence and relative positional encodings. Together, these mechanisms enabled the model to maintain context across segment boundaries while allowing natural extension to sequences longer than those seen during training. The architecture maintained the computational efficiency of processing fixed-length segments while gaining the benefits of much longer effective context.

The first key innovation was segment-level recurrence. Instead of processing each segment completely independently, Transformer-XL maintains a hidden state from previous segments and uses it when processing the current segment. During training, the model processes segments sequentially. For each segment, it computes the hidden states as in a standard transformer, but it also incorporates the hidden states from the previous segment. This creates a recurrent connection that allows information to flow across segment boundaries, effectively extending the model's memory beyond the segment length.

The mathematical formulation of segment-level recurrence works as follows. Let sτ=[xτ,1,,xτ,L]\mathbf{s}_\tau = [x_{\tau,1}, \ldots, x_{\tau,L}] represent a segment of length LL at position τ\tau, and let hτn\mathbf{h}_\tau^n represent the nn-th layer hidden state sequence for segment τ\tau. In a standard transformer, each layer computes hτn\mathbf{h}_\tau^n solely from hτn1\mathbf{h}_\tau^{n-1}, the hidden states from the previous layer of the same segment. In Transformer-XL, the model computes hτn\mathbf{h}_\tau^n from both hτn1\mathbf{h}_\tau^{n-1} (the previous layer of the current segment) and hτ1n\mathbf{h}_{\tau-1}^n (the same layer from the previous segment).

The attention mechanism is modified to incorporate previous segment information. For each position in the current segment, the model can attend to all positions in the current segment plus all positions in the cached previous segment. This means that if the current segment has length LL and we cache one previous segment, each position can attend to up to 2L2L positions, effectively doubling the context length while maintaining the computational cost of processing segments of length LL. The cached hidden states from previous segments are computed once and reused across multiple forward passes, making this approach computationally efficient.

The second key innovation was the introduction of relative positional encodings. Instead of encoding absolute positions in the sequence, Transformer-XL encodes the relative distance between positions. This change has several important benefits. First, it allows the model to generalize to sequences longer than those seen during training. Since relative positions are based on distance rather than absolute position, a model trained on sequences of length 512 can naturally handle sequences of length 1024 or longer, as long as the relative distances between tokens remain meaningful.

The relative positional encoding is incorporated into the attention mechanism itself. In standard transformers, attention scores between query position ii and key position jj are computed as scorei,j=qikj\text{score}_{i,j} = \mathbf{q}_i^\top \mathbf{k}_j, where queries and keys include absolute positional information. In Transformer-XL, the attention scores are modified to explicitly incorporate relative position information. The computation becomes more complex but enables the model to learn how relative positions should influence attention, rather than relying on fixed absolute positional encodings.

The relative positional encoding scheme uses learnable embeddings that encode relative distances. For a relative distance rr between positions ii and jj (where r=jir = j - i), the model learns embeddings Rr\mathbf{R}_r that capture how this distance should influence attention. These embeddings are used to modify both the key vectors (what information is available) and the query-key interactions (how positions relate to each other). The precise formulation involves adding terms based on relative position embeddings to both the content-based attention and the positional bias terms.

This design makes the attention mechanism position-aware in a relative sense. The model learns that tokens that are close together (small r|r|) should typically have stronger attention weights than tokens that are far apart (large r|r|), but it can also learn exceptions to this pattern when content similarity matters more than distance. The relative encoding scheme is flexible enough to handle sequences of varying lengths, as it focuses on the relationships between positions rather than their absolute locations.

The combination of segment-level recurrence and relative positional encodings creates a powerful architecture for handling long sequences. During training, the model processes documents in segments, but information flows between segments through the recurrent connections. During inference, the model can process sequences of arbitrary length by maintaining a cache of previous segment hidden states. This cache grows as the sequence lengthens, allowing the model to maintain context across very long sequences while still processing in manageable segments.

The architecture also includes several implementation details that improve efficiency. The hidden states from previous segments are cached and reused, avoiding redundant computation. When processing a new segment, the model only computes attention over the current segment and the cached previous segment, maintaining manageable computational cost. The caching mechanism allows the model to effectively have a context window that is a multiple of the segment length, with the multiplier determined by how many previous segments are cached.

Transformer-XL demonstrated that these architectural modifications could dramatically improve performance on long-sequence tasks while maintaining computational efficiency. The model achieved better perplexity scores on language modeling benchmarks, particularly on datasets with long-range dependencies. More importantly, it showed that transformers could be extended to handle longer contexts through careful architectural design, opening the door for subsequent innovations in long-context modeling.

Applications and Impact

Transformer-XL achieved immediate success on language modeling benchmarks, demonstrating superior performance on datasets that contained long-range dependencies. The model set new state-of-the-art results on several standard language modeling benchmarks, including WikiText-103 and the One Billion Word dataset. These improvements were particularly pronounced on tasks that required maintaining coherence across long contexts, where the segment-level recurrence mechanism provided significant advantages over standard transformers.

The architecture's ability to handle longer sequences made it valuable for several specific applications. In document-level understanding tasks, Transformer-XL could maintain context across entire documents rather than being limited to short segments. This capability was important for tasks like document classification, where understanding the full document context rather than just local patterns improved accuracy. The model could also better handle tasks like coreference resolution, where pronouns or references might appear many sentences apart.

In text generation tasks, Transformer-XL's extended context enabled more coherent long-form generation. Language models could maintain consistency across longer generated passages, as they had access to information from earlier in the sequence. This improvement was particularly noticeable in creative writing, technical documentation, or any generation task where maintaining coherence and avoiding repetition across long outputs was important.

The architecture also found applications in code modeling and generation. Programming languages often have long-range dependencies, such as when a function defined early in a file is called much later, or when class definitions span many lines. Transformer-XL's ability to maintain context across longer sequences made it better suited for code understanding and generation tasks than standard transformers with fixed-length context windows.

Perhaps more significant than these immediate applications was the influence Transformer-XL had on subsequent architecture development. The relative positional encoding scheme proved to be a particularly influential innovation. Many subsequent transformer variants adopted relative or modified positional encoding schemes, recognizing that absolute positional encodings were limiting. The idea that positional information could be incorporated more flexibly became a common theme in later architectures.

The segment-level recurrence mechanism, while specific to Transformer-XL's design, demonstrated the general principle that transformers could maintain longer contexts through architectural modifications. This principle influenced the development of other long-context transformer variants. Some subsequent models used sliding window attention, where each position could attend to a fixed window of previous positions, effectively implementing a form of recurrence through attention patterns. Others used memory mechanisms that stored summaries of previous segments, creating hierarchical approaches to long-context modeling.

Transformer-XL's success also highlighted the importance of efficient long-context modeling. As language models grew larger and were applied to more diverse tasks, the ability to handle longer sequences became increasingly valuable. The architecture's demonstration that context could be extended while maintaining efficiency influenced research priorities, encouraging further work on long-context transformers that could scale to even longer sequences.

The relative positional encoding scheme has had particularly lasting impact. Modern language models like GPT-3, PaLM, and LLaMA use variants of relative positional encoding or related schemes like Rotary Position Embedding (RoPE), which also encode relative rather than absolute positions. The insight that relative positional relationships are more generalizable than absolute positions has become standard practice in transformer design.

The architecture also contributed to understanding how transformers could be adapted for specific sequence length requirements. Different applications have different optimal context lengths, and Transformer-XL demonstrated that architectural modifications could flexibly extend context without requiring complete retraining. This flexibility has become important as language models are applied to diverse tasks with varying context requirements, from short conversations to long documents.

Limitations

While Transformer-XL addressed significant limitations of standard transformers, it also introduced some constraints and remained subject to certain fundamental challenges. The segment-level recurrence mechanism, while effective, requires maintaining cached hidden states from previous segments. This caching increases memory requirements, particularly when processing very long sequences or when maintaining caches across many segments. The memory cost grows linearly with the number of cached segments, which can become limiting for extremely long sequences.

The computational efficiency gains from segment-level recurrence are real but not unlimited. While processing each segment maintains manageable cost, the attention computation over cached segments still requires computation. When many segments are cached, the attention mechanism must compute scores over all positions in the current segment and all cached segments, which increases computation time. This means that while Transformer-XL extends context length more efficiently than standard transformers, there are still practical limits to how long sequences can be processed efficiently.

The relative positional encoding scheme, while more flexible than absolute encodings, still has limitations. The model must learn embeddings for relative distances, and if the training data doesn't contain examples of certain relative distances, the model may not handle them well. Very long relative distances, beyond those seen frequently during training, may not be well-represented. Additionally, the relative encoding scheme assumes that positional relationships are primarily determined by distance, which may not always be the case for all types of sequences or tasks.

The architecture's design assumes that sequences can be naturally segmented, but some types of data may not have clear segment boundaries. In these cases, the segmentation strategy becomes important and may require careful design. The model's performance can be sensitive to how sequences are divided into segments, which adds an additional consideration when applying the architecture to new domains or tasks.

Transformer-XL's improvements are most pronounced on tasks with long-range dependencies, but many tasks don't require such dependencies. For tasks where local context is sufficient, the additional complexity and computational cost of segment-level recurrence may not provide benefits that justify the overhead. The architecture is most valuable when long-range context is actually necessary for good performance.

The model still faces fundamental challenges related to attention over very long sequences. While segment-level recurrence extends effective context length, the quadratic scaling of attention computation remains a concern. Processing sequences that are tens of thousands of tokens long, even with efficient segment processing, requires significant computational resources. The architecture improves efficiency relative to standard transformers but doesn't eliminate the fundamental computational challenges of long-sequence processing.

Another limitation is that the cached hidden states from previous segments represent information at a fixed point in processing. As the model processes new segments, it may discover information that would change how previous segments should be interpreted, but the cached states don't update. This means the model's understanding of earlier segments remains static, which may limit its ability to revise interpretations based on later context. Some subsequent architectures have addressed this through mechanisms that allow updating cached representations.

The relative positional encoding scheme, while more flexible than absolute encodings, still embeds assumptions about how positions should relate. The learnable relative position embeddings capture patterns from training data, but these patterns may not generalize perfectly to all types of sequences or domains. Sequences with unusual structure or that violate assumptions about how positions relate may not benefit as much from the relative encoding approach.

Legacy and Looking Forward

Transformer-XL's innovations have had lasting influence on transformer architecture development. The relative positional encoding scheme, in particular, has become a standard component of many modern language models. The insight that encoding relative positions rather than absolute positions enables better generalization has been widely adopted. Modern models like GPT-3, PaLM, LLaMA, and others use variants of relative positional encoding, including Rotary Position Embedding (RoPE) and other schemes that encode relative rather than absolute positional relationships.

The segment-level recurrence mechanism demonstrated that transformers could maintain longer contexts through architectural modifications without requiring complete architectural redesign. This principle has influenced subsequent work on long-context transformers. While many later architectures use different mechanisms for extending context, they share Transformer-XL's goal of enabling longer context while maintaining computational efficiency.

The architecture's success on language modeling tasks highlighted the importance of long-context modeling for natural language processing. As language models have grown larger and more capable, the ability to process longer sequences has become increasingly valuable. Many practical applications require understanding context that spans thousands or tens of thousands of tokens, from analyzing long documents to maintaining coherence in extended conversations. Transformer-XL helped establish long-context modeling as an important research direction.

The architecture also contributed to understanding how transformers could be adapted for specific requirements. Different applications have different optimal context lengths and computational constraints, and Transformer-XL demonstrated that architectural modifications could extend context flexibly. This adaptability has become important as transformers are applied to increasingly diverse tasks, each with its own context length requirements.

Subsequent research has built on Transformer-XL's ideas while addressing some of its limitations. Models like Longformer use sliding window attention to extend context, implementing a form of efficient long-range attention. Sparse attention mechanisms, as used in models like BigBird, create patterns of attention that maintain long-range connections while reducing computation. These approaches share Transformer-XL's goal of extending context length but use different mechanisms.

The relative positional encoding innovation has been particularly influential. Research has explored many variations on this theme, from RoPE's rotary embeddings to Attention with Linear Biases (ALiBi), which uses learned linear biases based on relative distance. These developments show that the core insight of relative positional encoding has been widely recognized and adapted across the field.

Looking forward, the challenges of long-context modeling remain active areas of research. As models scale to handle even longer sequences, efficiency becomes increasingly important. Research continues into attention mechanisms that can handle very long sequences efficiently, from sparse attention patterns to hierarchical architectures that summarize long contexts. Transformer-XL's contributions to this area continue to influence ongoing work.

The architecture's demonstration that transformers could be extended to longer contexts has also influenced how language models are evaluated and applied. Benchmark datasets and tasks increasingly include longer contexts, recognizing that many practical applications require understanding extended sequences. The field's focus on long-context capabilities has grown, partly because Transformer-XL showed that such capabilities were achievable.

Transformer-XL's introduction in 2019 represents an important moment in transformer architecture evolution. By addressing the fundamental limitations of fixed context length and segment independence, the architecture opened new possibilities for applying transformers to tasks requiring long-range understanding. The innovations it introduced, particularly relative positional encoding, have become standard components of modern language models. While subsequent architectures have built on and extended these ideas, Transformer-XL's contributions to enabling longer context in transformers have had lasting impact on the field.

Quiz

Ready to test your understanding of Transformer-XL? Challenge yourself with these questions about this important architectural innovation that extended transformers to handle longer sequences, and see how well you've grasped the key concepts that made long-context modeling possible. Good luck!

Loading component...

Reference

BIBTEXAcademic
@misc{transformerxlextendingtransformerstolongsequences, author = {Michael Brenndoerfer}, title = {Transformer-XL: Extending Transformers to Long Sequences}, year = {2025}, url = {https://mbrenndoerfer.com/writing/transformer-xl-long-sequences-segment-recurrence}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-02} }
APAAcademic
Michael Brenndoerfer (2025). Transformer-XL: Extending Transformers to Long Sequences. Retrieved from https://mbrenndoerfer.com/writing/transformer-xl-long-sequences-segment-recurrence
MLAAcademic
Michael Brenndoerfer. "Transformer-XL: Extending Transformers to Long Sequences." 2025. Web. 11/2/2025. <https://mbrenndoerfer.com/writing/transformer-xl-long-sequences-segment-recurrence>.
CHICAGOAcademic
Michael Brenndoerfer. "Transformer-XL: Extending Transformers to Long Sequences." Accessed 11/2/2025. https://mbrenndoerfer.com/writing/transformer-xl-long-sequences-segment-recurrence.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Transformer-XL: Extending Transformers to Long Sequences'. Available at: https://mbrenndoerfer.com/writing/transformer-xl-long-sequences-segment-recurrence (Accessed: 11/2/2025).
SimpleBasic
Michael Brenndoerfer (2025). Transformer-XL: Extending Transformers to Long Sequences. https://mbrenndoerfer.com/writing/transformer-xl-long-sequences-segment-recurrence
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.