Explore how XLNet, RoBERTa, and ALBERT refined BERT through permutation language modeling, optimized training procedures, and architectural efficiency. Learn about bidirectional autoregressive pretraining, dynamic masking, and parameter sharing innovations that advanced transformer language models.

This article is part of the free-to-read History of Language AI book
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
2019: XLNet, RoBERTa, ALBERT
BERT's success in 2018 had demonstrated the power of pretrained bidirectional encoders, but the rapid pace of research meant that improvements were already being developed even as BERT gained widespread adoption. Throughout 2019, three major refinements to BERT emerged, each addressing different limitations and pushing the boundaries of what pretrained transformers could achieve. These developments reflected a maturing understanding of pretraining objectives, training procedures, and architectural efficiency in language models.
XLNet, developed by researchers at Carnegie Mellon University and Google Brain led by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le, introduced a fundamentally different pretraining approach that sought to overcome BERT's masked language modeling limitations through permutation language modeling. RoBERTa, from Facebook AI Research led by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, demonstrated that BERT's architecture was sound but its training procedure could be significantly improved. ALBERT, from Google Research led by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut, focused on making BERT more parameter-efficient through architectural innovations that enabled larger models to be trained with the same computational resources.
These three models represented different philosophies in improving pretrained language models. XLNet questioned whether masked language modeling was the optimal pretraining objective, proposing instead an autoregressive approach that could capture bidirectional context without the artificial masking tokens. RoBERTa maintained BERT's architecture but systematically optimized every aspect of its training, showing that methodological improvements could yield substantial gains. ALBERT rethought the architecture itself, finding ways to reduce parameter count while maintaining or improving performance, enabling training of larger models within fixed computational budgets.
The significance of these three models extended beyond their individual contributions. Together, they demonstrated that the pretrained transformer paradigm had considerable headroom for improvement, and that multiple paths toward better language understanding were viable. XLNet showed that pretraining objectives mattered deeply, RoBERTa showed that training methodology was crucial, and ALBERT showed that architectural efficiency could unlock new capabilities. These insights would inform the development of subsequent models like T5, GPT-3, and the many transformer variants that followed.
The Problem
Despite BERT's impressive success, several limitations became apparent as researchers deployed it widely and analyzed its behavior more carefully. These limitations fell into three categories: fundamental issues with the masked language modeling objective, suboptimal training procedures, and inefficient parameter usage that constrained model scale.
Limitations of Masked Language Modeling
BERT's masked language modeling approach, while effective, introduced artifacts that limited its ability to learn optimal representations. The model never saw actual words during pretraining, only masked tokens. During fine-tuning, however, the model encountered real words, creating a mismatch between pretraining and downstream tasks. This pretrain-finetune discrepancy meant that representations learned during pretraining might not transfer optimally to real-word contexts.
Masked language modeling also suffered from independence assumptions. When predicting multiple masked tokens in the same sequence, BERT predicted each token independently, even though these tokens appeared in the same context. In reality, masked tokens were conditionally dependent on each other. For example, in "The capital of France is [MASK]", predicting "Paris" for the mask should depend on understanding that we're discussing geography, but BERT's parallel prediction meant it couldn't leverage the relationships between simultaneously masked tokens.
The bidirectional context captured by BERT came at the cost of not learning a generative model. BERT could not generate text autoregressively because it was trained with bidirectional attention. This limitation mattered for tasks requiring generation, such as text summarization or dialogue systems, where autoregressive capabilities were essential. The inability to generate text meant BERT was limited to understanding tasks rather than both understanding and generation.
Suboptimal Training Procedures
BERT's original training procedure, while sufficient to demonstrate the model's potential, left room for improvement in several areas. The model was trained for relatively few steps on a dataset that, while large, could be expanded significantly. The training used static masking, where the same masking pattern was applied to each sequence during multiple epochs, potentially leading to overfitting to specific masking patterns.
The next sentence prediction (NSP) task, intended to capture sentence-level relationships, showed mixed results. Some analyses suggested that NSP was too easy and provided limited signal. The binary classification task of predicting whether two sentences were consecutive or not might not have been the most effective way to learn sentence relationships. Some experiments showed that removing NSP entirely could improve performance, suggesting the task was not as valuable as initially thought.
BERT's training hyperparameters had been chosen conservatively. The learning rate schedule, batch size, and other training details had been selected to ensure stability, but might not have been optimal for maximizing performance. As computational resources became more available and understanding of transformer training deepened, more aggressive training regimes became feasible.
Parameter Inefficiency
BERT's architecture, while powerful, used parameters inefficiently in several ways. The model stored separate embeddings for token embeddings, segment embeddings, and position embeddings, with the vocabulary size and hidden dimension size determining embedding parameters. For large models, these embedding parameters consumed significant memory but might not have needed the full hidden dimension size.
The transformer layers themselves consumed most parameters, but researchers questioned whether all layers needed independent parameters. If lower layers learned similar transformations across positions, parameter sharing might enable learning more general patterns with fewer parameters. Additionally, the attention mechanism and feedforward networks scaled quadratically and linearly with sequence length respectively, making longer sequences computationally expensive.
The parameter inefficiency meant that training larger BERT models required proportionally more computational resources. Doubling the model size roughly doubled the computational cost, limiting the scale of models that could be practically trained. For a fixed computational budget, this inefficiency prevented exploring larger model architectures that might have captured more complex patterns.
The Solutions
XLNet: Permutation Language Modeling
XLNet addressed the masked language modeling limitations by introducing a generalized autoregressive pretraining approach that could capture bidirectional context without masking. The key insight was that autoregressive models could be made bidirectional by considering all possible factorization orders of the input sequence.
Permutation Language Modeling
Instead of masking tokens, XLNet used permutation language modeling. For a sequence of length , there are possible permutation orders. XLNet sampled a permutation order during training and predicted tokens in that order. This approach maintained the autoregressive property needed for valid probability distributions while allowing the model to see both left and right context for any given position.
For example, given the sequence "The capital of France is Paris", XLNet might sample permutation order where positions are indexed from 1. The model would predict position 3 first (seeing positions 1, 2, 4, 5), then position 1 (seeing positions 2, 4, 5), then position 5 (seeing positions 1, 2, 4), and so on. By averaging over many different permutations during training, the model learned representations that captured bidirectional context.
The permutation approach solved several problems. First, it eliminated the pretrain-finetune discrepancy because the model always saw actual tokens, never masked placeholders. Second, it naturally captured dependencies between tokens because the autoregressive structure modeled conditional distributions. Third, it maintained the ability to be generative while capturing bidirectional context, unlike BERT's purely bidirectional approach.
Two-Stream Self-Attention
To implement permutation language modeling efficiently, XLNet introduced two-stream self-attention. The query stream handled the position being predicted, while the content stream handled positions already observed. This dual-stream architecture allowed the model to properly implement the autoregressive structure while maintaining computational efficiency comparable to BERT.
The content stream used both content and position embeddings, allowing tokens to access their own content and position information. The query stream used only position embeddings, ensuring that when predicting a token, the model could not directly access that token's content. This structure enforced the autoregressive constraint while allowing bidirectional context through the permutation mechanism.
Relative Positional Encoding
XLNet also used relative positional encodings, similar to those introduced in Transformer-XL. Rather than encoding absolute positions, the model encoded relative distances between query and key positions. This approach improved generalization to sequences longer than those seen during training and better captured positional relationships.
RoBERTa: Optimized Training
RoBERTa maintained BERT's architecture but systematically improved its training procedure, demonstrating that better training methodology could yield substantial improvements without architectural changes.
Training Data and Duration
RoBERTa used significantly more training data than BERT. Where BERT had trained on BooksCorpus and English Wikipedia totaling about 16GB, RoBERTa expanded to include CommonCrawl news data, web text, and stories, totaling about 160GB. This tenfold increase in training data provided more diverse examples and improved generalization.
The model also trained for longer. BERT had trained for 1M steps, but RoBERTa extended this significantly, allowing the model to fully converge and extract maximum benefit from the larger dataset. The combination of more data and longer training enabled the model to learn richer representations.
Dynamic Masking
RoBERTa replaced BERT's static masking with dynamic masking. Instead of applying the same masking pattern to each sequence across epochs, RoBERTa generated a new masking pattern each time a sequence was processed. This prevented the model from overfitting to specific masking patterns and encouraged learning more robust representations.
Dynamic masking was straightforward to implement but had significant impact. By varying the masking patterns, the model learned to handle unpredictable contexts, which better matched the variability in real-world text where word relationships could take many forms.
Removing Next Sentence Prediction
RoBERTa removed the next sentence prediction task entirely. Experiments showed that NSP was not necessary and might even hurt performance. Instead, the model focused solely on masked language modeling, using longer sequences and more training data to learn sentence-level relationships implicitly.
The removal of NSP simplified the training procedure and allowed the model to allocate more capacity to learning word-level and sequence-level patterns through masked language modeling alone. The results validated that masked language modeling was sufficient to learn high-quality representations.
Optimized Hyperparameters
RoBERTa used carefully tuned hyperparameters optimized for the larger dataset and longer training. The learning rate schedule, batch size, and other training details were adjusted based on empirical findings. While these changes were incremental, their cumulative effect contributed to improved performance.
ALBERT: Parameter Efficiency
ALBERT redesigned BERT's architecture to use parameters more efficiently, enabling training of larger models within fixed computational budgets.
Factorized Embedding Parameterization
ALBERT separated the vocabulary embedding size from the hidden dimension size. In BERT, these were the same, meaning embedding parameters scaled as where is vocabulary size and is hidden dimension. ALBERT introduced an intermediate embedding dimension smaller than , creating embeddings of size that were then projected to size through a linear transformation.
This factorization reduced embedding parameters from to . For typical values where and , this reduced embedding parameters substantially. The projection layer added only parameters, which was much smaller than the savings from reducing the embedding matrix.
Cross-Layer Parameter Sharing
ALBERT shared parameters across all transformer layers, rather than having independent parameters for each layer. This meant that all layers learned the same transformations, dramatically reducing the parameter count. For a 12-layer model, this reduced parameters by roughly a factor of 12 for the layer-specific components.
Parameter sharing forced the model to learn more general transformations that worked across all positions in the network. While this constraint might seem limiting, experiments showed that shared-parameter models could achieve comparable performance to unshared models while using far fewer parameters. The shared parameters learned to represent information at different abstraction levels implicitly through the layer depth.
Sentence Order Prediction
ALBERT replaced next sentence prediction with sentence order prediction (SOP). Instead of predicting whether two sentences were consecutive, SOP predicted whether two sentences appeared in the correct order. This task proved more effective because it required understanding sentence-level coherence, not just adjacency.
SOP worked by swapping the order of consecutive sentences half the time during training. The model learned to distinguish whether sentences appeared in their natural order or had been swapped. This objective better captured sentence-level relationships and improved downstream performance on tasks requiring sentence understanding.
Effects of Parameter Efficiency
These architectural innovations enabled ALBERT to train much larger models. While BERT-base had 110M parameters and BERT-large had 340M parameters, ALBERT-base achieved similar performance with 12M parameters, and ALBERT-large achieved better performance than BERT-large with 18M parameters. The extreme parameter efficiency meant that even larger ALBERT models could be trained within reasonable computational budgets.
Applications and Impact
These three models found immediate applications across diverse NLP tasks, each offering advantages for different use cases. XLNet's bidirectional autoregressive approach showed particular strength on tasks requiring both understanding and generation. RoBERTa's optimized training made it a strong default choice for many downstream tasks. ALBERT's efficiency enabled deployment in resource-constrained environments.
XLNet demonstrated state-of-the-art performance on many benchmarks when it was released, including GLUE, RACE, and SQuAD. The model's ability to capture bidirectional context without masking artifacts proved valuable for tasks requiring nuanced understanding. The permutation language modeling approach influenced subsequent research into pretraining objectives, though its computational complexity limited widespread adoption.
RoBERTa quickly became a popular choice for fine-tuning on downstream tasks, often serving as a strong baseline for new research. The model's straightforward architecture, combined with its improved training, made it accessible and effective. Many production systems adopted RoBERTa as their backbone model, and the training methodology improvements influenced how subsequent models were trained.
ALBERT's parameter efficiency made it valuable for applications with computational constraints. Mobile and edge devices could deploy ALBERT models that would be impossible with equivalently sized BERT models. The architectural innovations, particularly parameter sharing and factorized embeddings, influenced subsequent research into efficient transformers. The insight that parameter sharing could maintain performance while dramatically reducing parameters opened new possibilities for model scaling.
The three models collectively demonstrated that pretrained transformers had substantial room for improvement. XLNet showed that pretraining objectives could be fundamentally rethought. RoBERTa showed that training methodology improvements could yield large gains. ALBERT showed that architectural efficiency could enable new capabilities. These lessons informed the development of later models like T5, which unified generation and understanding tasks, and GPT-3, which scaled to unprecedented sizes.
Research applications found particular value in ALBERT's efficiency. The ability to train and experiment with larger models using fewer resources accelerated research progress. The parameter sharing techniques influenced work on efficient transformers, including models designed for specific hardware constraints or specialized applications.
Limitations
Despite their improvements, each model retained limitations and introduced new challenges. XLNet's permutation language modeling, while theoretically elegant, increased computational complexity compared to BERT. The need to sample permutations and maintain two-stream attention made training and inference more expensive. The model required careful implementation to achieve the promised benefits, and the complexity limited its adoption compared to simpler alternatives.
RoBERTa's improvements came primarily from more data and longer training, which required substantial computational resources. The tenfold increase in training data meant that reproducing RoBERTa required access to large datasets and significant compute. While the methodology improvements were valuable, they didn't fundamentally change the model's capabilities or address core limitations like static embeddings.
ALBERT's parameter sharing, while efficient, might have limited the model's representational capacity. All layers learning the same transformations could constrain the model's ability to learn hierarchical representations. Some analysis suggested that ALBERT models might require more training to converge due to the parameter sharing constraint, partially offsetting the efficiency gains.
All three models retained BERT's limitation of producing static embeddings that didn't capture context-dependent meaning variations as flexibly as later models. They also maintained the encoder-only architecture, limiting their ability to generate text naturally. XLNet's autoregressive structure theoretically enabled generation, but the permutation mechanism made generation more complex than in purely autoregressive models.
The models also shared limitations with transformer architectures more broadly. They struggled with very long sequences due to quadratic attention complexity. They required substantial computational resources for training, even with ALBERT's efficiency improvements. They couldn't easily incorporate new information after training without fine-tuning.
Legacy
XLNet, RoBERTa, and ALBERT collectively demonstrated that the pretrained transformer paradigm was far from mature in 2019. Their different approaches showed that multiple paths toward improvement were viable, and that substantial gains could come from rethinking objectives, optimizing training, or redesigning architectures.
RoBERTa's training methodology improvements became standard practice for training transformer models. The emphasis on more data, longer training, dynamic masking, and careful hyperparameter tuning informed how subsequent models were trained. The insight that training methodology could yield large improvements without architectural changes influenced research priorities, leading to more systematic investigation of training procedures.
ALBERT's parameter efficiency techniques influenced subsequent research into efficient transformers. The ideas of parameter sharing, factorized embeddings, and architectural redesign for efficiency appeared in many later models. The extreme efficiency demonstrated by ALBERT showed that large models weren't always necessary, and that careful architecture design could achieve strong performance with fewer resources.
XLNet's permutation language modeling, while less widely adopted, influenced thinking about pretraining objectives. The insight that autoregressive models could be made bidirectional through permutation opened new directions for research. The idea of using all possible factorization orders, while computationally expensive, showed that pretraining objectives could be fundamentally rethought.
The three models also highlighted the importance of systematic evaluation and ablation studies. RoBERTa's careful analysis of training components showed how important it was to understand which aspects of training mattered most. ALBERT's architectural experiments demonstrated how systematic design choices could achieve efficiency. These methodological contributions influenced how subsequent models were developed and evaluated.
The architectural innovations, particularly from ALBERT, found applications in resource-constrained environments. Mobile devices, edge computing, and specialized hardware could deploy efficient transformer models enabled by these techniques. The parameter sharing and factorization ideas appeared in many later efficient transformer variants designed for specific deployment scenarios.
The collective impact of these three models was to show that pretrained transformers were still rapidly evolving. BERT had demonstrated the paradigm's viability, but XLNet, RoBERTa, and ALBERT showed that substantial improvements were still possible. This insight encouraged continued research into transformer architectures, training methods, and efficiency techniques, leading to the explosion of transformer variants that followed.
Looking forward, these models established patterns that would persist in subsequent developments. The emphasis on training methodology from RoBERTa would continue in later models. The efficiency focus from ALBERT would become increasingly important as models scaled. The rethinking of pretraining objectives from XLNet would influence unified models like T5 that sought to handle multiple tasks with single architectures. The collective lessons from these three models informed the next generation of language models that would achieve even greater capabilities.
Quiz
Ready to test your understanding of XLNet, RoBERTa, and ALBERT? Challenge yourself with these questions about how these three models refined BERT and see how well you've grasped the key innovations in pretrained transformers.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

t-SNE: Complete Guide to Dimensionality Reduction & High-Dimensional Data Visualization
A comprehensive guide covering t-SNE (t-Distributed Stochastic Neighbor Embedding), including mathematical foundations, probability distributions, KL divergence optimization, and practical implementation. Learn how to visualize complex high-dimensional datasets effectively.

LIME Explainability: Complete Guide to Local Interpretable Model-Agnostic Explanations
A comprehensive guide covering LIME (Local Interpretable Model-Agnostic Explanations), including mathematical foundations, implementation strategies, and practical applications. Learn how to explain any machine learning model's predictions with interpretable local approximations.

UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction
A comprehensive guide covering UMAP dimensionality reduction, including mathematical foundations, fuzzy simplicial sets, manifold learning, and practical implementation. Learn how to preserve both local and global structure in high-dimensional data visualization.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
