
For
Engineers, researchers, students, AI enthusiasts, linguists, product managers, and anyone interested in understanding or building modern language AI systems, from foundational NLP to advanced large language models.
Language AI Handbook
A Complete Guide to Natural Language Processing and Large Language Models: From Classical NLP and Transformer Architecture to Pre-training, Fine-tuning, and Production Deployment
About This Book
Language AI has transformed from an academic curiosity into the defining technology of our era. But beneath the hype of ChatGPT and Claude lies a rich technical landscape that most practitioners only partially understand. This handbook gives you the complete picture, from classical NLP techniques that still matter to the cutting-edge architectures powering today's most capable systems.
Begin with the fundamentals that never go out of style: tokenization, embeddings, and the statistical foundations that inform modern approaches. Then dive deep into the transformer architecture. Learn not just how to use it, but how it actually works. Understand self-attention mathematically, grasp why positional encodings matter, and see how architectural choices like layer normalization affect training dynamics.
Track Your Progress
Sign in to mark chapters as complete, track quiz scores, and see your reading journey
What's Inside
Text as Data
Classical Text Representations
Distributional Semantics
Word Embeddings
Subword Tokenization
Sequence Labeling
Table of Contents
Part I: Text as Data
5 chapters
Part I: Text as Data
Character Encoding
Covers ASCII origins and 7-bit limitations, Unicode code points and planes, UTF-8 variable-width encoding scheme, byte order marks and endianness, encoding detection heuristics, common encoding errors and mojibake, practical encoding/decoding in Python.
Text Normalization
Covers Unicode normalization forms (NFC, NFD, NFKC, NFKD), case folding vs lowercasing, accent and diacritic handling, whitespace normalization, ligature expansion, full-width to half-width conversion, implementing a normalization pipeline.
Regular Expressions
Covers regex syntax and metacharacters, character classes and quantifiers, grouping and backreferences, lookahead and lookbehind assertions, greedy vs lazy matching, common NLP patterns (emails, URLs, dates), regex performance considerations.
Sentence Segmentation
Covers period disambiguation challenges, abbreviation handling, rule-based boundary detection, Punkt sentence tokenizer algorithm, evaluation metrics for segmentation, handling edge cases (quotes, parentheses, lists), multilingual segmentation issues.
Word Tokenization
Covers whitespace tokenization limitations, punctuation handling rules, contractions and clitics, language-specific challenges (Chinese, Japanese, German compounds), Penn Treebank tokenization standard, building a rule-based tokenizer, tokenization evaluation.
Part II: Classical Text Representations
9 chapters
Part II: Classical Text Representations
Bag of Words
Covers document-term matrix construction, vocabulary building from corpus, word counting and frequency vectors, sparse matrix representation (CSR/CSC formats), vocabulary pruning (min_df, max_df), binary vs count representations, limitations of word order loss.
N-grams
Covers bigram and trigram extraction, n-gram vocabulary explosion, n-gram frequency distributions, Zipf's law in n-grams, character n-grams for robustness, skip-grams and flexible windows, n-gram indexing for search.
N-gram Language Models
Covers Markov assumption and chain rule, maximum likelihood estimation, probability calculation for sequences, handling unseen n-grams, start and end tokens, generating text from n-gram models, model storage and lookup efficiency.
Smoothing Techniques
Covers add-one (Laplace) smoothing, add-k smoothing and tuning, Good-Turing smoothing derivation, Kneser-Ney smoothing intuition and formula, interpolation vs backoff, modified Kneser-Ney, comparing smoothing methods empirically.
Perplexity
Covers cross-entropy definition and derivation, perplexity as branching factor, relationship to bits-per-character, held-out evaluation methodology, perplexity vs downstream performance, comparing models with perplexity, perplexity limitations and caveats.
Term Frequency
Covers raw term frequency, log-scaled term frequency, boolean term frequency, augmented term frequency, L2-normalized frequency vectors, term frequency sparsity patterns, efficient term frequency computation.
Inverse Document Frequency
Covers document frequency calculation, IDF formula derivation, IDF intuition (rare words matter more), smoothed IDF variants, IDF across corpus splits, relationship to information theory, implementing IDF efficiently.
TF-IDF
Covers TF-IDF formula and variants, TF-IDF vector computation, TF-IDF normalization options, BM25 as TF-IDF extension, document similarity with TF-IDF, TF-IDF for feature extraction, sklearn TfidfVectorizer deep dive.
BM25
Covers BM25 derivation from probabilistic IR, saturation parameter k1, length normalization parameter b, BM25+ and BM25L variants, field-weighted BM25, implementing BM25 scoring, BM25 vs TF-IDF empirically.
Part III: Distributional Semantics
4 chapters
Part III: Distributional Semantics
The Distributional Hypothesis
Covers Firth's "you shall know a word by the company it keeps," distributional similarity intuition, context window definitions, paradigmatic vs syntagmatic relations, word similarity from distributions, limitations of distributional semantics.
Co-occurrence Matrices
Covers word-word co-occurrence matrices, word-document matrices, context window size effects, weighting by distance, symmetric vs directional contexts, matrix sparsity patterns, efficient construction algorithms.
Pointwise Mutual Information
Covers PMI formula derivation, PMI interpretation as association, positive PMI (PPMI), shifted PPMI variants, PMI matrix properties, PMI vs raw counts comparison, PMI for collocation extraction.
Singular Value Decomposition
Covers SVD mathematical formulation, truncated SVD for dimensionality reduction, LSA (Latent Semantic Analysis), choosing embedding dimensions, SVD computational complexity, randomized SVD for scale, interpreting SVD dimensions.
Part IV: Word Embeddings
9 chapters
Part IV: Word Embeddings
Skip-gram Model
Covers skip-gram architecture diagram, input/output representations, softmax over vocabulary, skip-gram objective function, training data generation, window size hyperparameter, skip-gram vs CBOW intuition.
CBOW Model
Covers CBOW architecture, context word averaging, CBOW objective function, CBOW vs skip-gram training speed, CBOW for frequent words, implementing CBOW forward pass, CBOW gradient derivation.
Negative Sampling
Covers softmax computational bottleneck, negative sampling objective derivation, sampling distribution (unigram^0.75), number of negatives hyperparameter, negative sampling gradient computation, NCE vs negative sampling, implementing efficient sampling.
Hierarchical Softmax
Covers binary tree construction (Huffman coding), path probability computation, hierarchical softmax objective, gradient computation along paths, tree structure impact on learning, hierarchical softmax vs negative sampling, when to use each approach.
Word2Vec Training
Covers data preprocessing pipeline, subsampling frequent words, learning rate scheduling, minibatch vs online training, convergence monitoring, gensim Word2Vec usage, training from scratch in PyTorch.
Word Analogy
Covers vector arithmetic for analogies, parallelogram model, analogy evaluation datasets, 3CosAdd vs 3CosMul methods, analogy accuracy metrics, limitations of analogy evaluation, what analogies reveal about embeddings.
GloVe
Covers GloVe objective function derivation, weighted least squares formulation, relationship to matrix factorization, weighting function design, bias terms in GloVe, GloVe vs Word2Vec comparison, training GloVe efficiently.
FastText
Covers character n-gram representation, word vector as n-gram sum, FastText architecture, handling OOV words, morphological awareness, FastText for morphologically rich languages, training FastText models.
Embedding Evaluation
Covers intrinsic vs extrinsic evaluation, word similarity datasets (SimLex, WordSim), analogy accuracy, embedding visualization (t-SNE, UMAP), downstream task evaluation, embedding bias detection, evaluation pitfalls.
Part V: Subword Tokenization
8 chapters
Part V: Subword Tokenization
The Vocabulary Problem
Covers OOV word problem, vocabulary size explosion, rare word representation, morphological productivity, compound words, code and technical text, the case for subword units.
Byte Pair Encoding
Covers BPE algorithm step-by-step, merge rules learning, vocabulary size control, BPE encoding procedure, BPE decoding procedure, BPE implementation from scratch, BPE hyperparameters.
WordPiece
Covers WordPiece vs BPE differences, likelihood objective for merges, greedy tokenization algorithm, ## prefix notation, WordPiece in BERT, training WordPiece tokenizers, handling unknown characters.
Unigram Language Model Tokenization
Covers unigram LM formulation, EM algorithm for training, Viterbi decoding for tokenization, sampling multiple segmentations, subword regularization, unigram vs BPE comparison, SentencePiece unigram mode.
SentencePiece
Covers treating text as raw bytes, whitespace handling (▁ prefix), BPE and unigram modes, training from raw text, pretokenization elimination, SentencePiece in production, multilingual tokenization.
Tokenizer Training
Covers corpus preparation, vocabulary size selection, special tokens configuration, training with HuggingFace tokenizers, saving and loading tokenizers, tokenizer versioning, domain-specific tokenizers.
Special Tokens
Covers [CLS], [SEP], [PAD], [MASK], [UNK] tokens, beginning/end of sequence tokens, custom special tokens, special token embeddings, token type IDs, handling special tokens in generation.
Tokenization Challenges
Covers number tokenization issues, code tokenization, multilingual text mixing, emoji and Unicode edge cases, tokenization artifacts, adversarial tokenization, measuring tokenization quality.
Part VI: Sequence Labeling
8 chapters
Part VI: Sequence Labeling
Part-of-Speech Tagging
Covers POS tag sets (Penn Treebank, Universal), POS tagging as classification, contextual disambiguation, POS tagging accuracy metrics, POS tagging for downstream tasks, rule-based vs statistical taggers.
Named Entity Recognition
Covers entity types (PER, ORG, LOC, etc.), NER as sequence labeling, nested entity challenges, entity boundary detection, NER evaluation (exact vs partial match), NER datasets and benchmarks.
BIO Tagging
Covers BIO scheme explanation, BIOES/BILOU variants, converting spans to BIO tags, BIO decoding to spans, handling tagging inconsistencies, BIO for multi-label scenarios, implementing BIO utilities.
Chunking
Covers noun phrase chunking, chunk types (NP, VP, PP), IOB tagging for chunks, chunking vs full parsing, chunking evaluation, chunking as preprocessing, regex chunking with NLTK.
Hidden Markov Models
Covers HMM components (states, observations, transitions), emission and transition probabilities, HMM assumptions (Markov, independence), HMM for POS tagging, HMM parameter estimation, HMM limitations for NLP.
Viterbi Algorithm
Covers optimal path problem formulation, Viterbi recursion derivation, backpointer tracking, Viterbi complexity analysis, log-space computation, implementing Viterbi efficiently, Viterbi for beam search foundation.
Conditional Random Fields
Covers CRF vs HMM comparison, CRF feature functions, log-linear formulation, partition function computation, CRF for NER, CRF inference complexity, neural CRF layers.
CRF Training
Covers CRF log-likelihood objective, forward-backward algorithm, gradient computation, L-BFGS optimization, feature template design, CRF regularization, CRF training convergence.
Part VII: Neural Network Foundations
13 chapters
Part VII: Neural Network Foundations
Linear Classifiers
Covers linear decision boundaries, weight vectors and bias, dot product interpretation, multiclass classification (softmax), linear classifier limitations, training with gradient descent.
Activation Functions
Covers sigmoid function and saturation, tanh properties, ReLU and dying ReLU, Leaky ReLU and PReLU, ELU and SELU, GELU derivation and properties, Swish and Mish, choosing activation functions.
Multilayer Perceptrons
Covers hidden layers and depth, weight matrices between layers, forward pass computation, representational capacity, MLP for classification, MLP for regression, MLP architecture design.
Loss Functions
Covers cross-entropy loss derivation, MSE for regression, binary vs multiclass cross-entropy, label smoothing, focal loss for imbalance, loss function numerical stability, custom loss functions.
Backpropagation
Covers computational graphs, chain rule review, forward and backward pass, gradient accumulation, backprop complexity analysis, automatic differentiation, implementing backprop from scratch.
Stochastic Gradient Descent
Covers batch vs stochastic gradient descent, minibatch gradient descent, learning rate selection, SGD convergence properties, SGD noise as regularization, learning rate schedules basics, SGD implementation.
Momentum
Covers momentum intuition (ball rolling), momentum update equations, momentum coefficient selection, dampening oscillations, momentum vs vanilla SGD, Nesterov momentum derivation, implementing momentum.
Adam Optimizer
Covers exponential moving averages, first moment (mean) estimation, second moment (variance) estimation, bias correction derivation, Adam update rule, Adam hyperparameters, Adam convergence properties.
AdamW
Covers L2 regularization vs weight decay, why they differ with Adam, AdamW formulation, weight decay coefficient selection, AdamW as default optimizer, AdamW vs Adam empirically.
Weight Initialization
Covers random initialization importance, Xavier/Glorot initialization derivation, He initialization for ReLU, initialization for different activations, layer-wise initialization, initialization debugging, modern initialization practices.
Batch Normalization
Covers internal covariate shift, batch statistics computation, learnable scale and shift, training vs inference mode, batch norm gradient flow, batch norm placement debates, batch norm limitations.
Dropout
Covers dropout as ensemble, dropout mask sampling, inverted dropout scaling, dropout rate selection, dropout at inference, spatial dropout for sequences, dropout in modern architectures.
Gradient Clipping
Covers gradient explosion detection, clip by value, clip by global norm, gradient clipping implementation, when to use gradient clipping, clipping threshold selection, monitoring gradient norms.
Part VIII: Recurrent Neural Networks
9 chapters
Part VIII: Recurrent Neural Networks
RNN Architecture
Covers recurrent connection intuition, hidden state as memory, unrolled computation graph, parameter sharing across time, RNN for sequence classification, RNN for sequence generation, RNN equations and dimensions.
Backpropagation Through Time
Covers BPTT derivation, gradient flow through time, truncated BPTT, BPTT memory requirements, BPTT implementation, gradient accumulation across timesteps.
Vanishing Gradients
Covers gradient product across timesteps, vanishing gradient analysis, long-range dependency failure, gradient visualization, vanishing vs exploding trade-off, architectural solutions overview.
LSTM Architecture
Covers cell state as information highway, gate mechanism intuition, LSTM diagram walkthrough, information flow in LSTMs, LSTM for long sequences, LSTM memory capacity.
LSTM Gate Equations
Covers forget gate equations, input gate equations, cell state update, output gate equations, hidden state computation, LSTM parameter count, implementing LSTM from scratch.
LSTM Gradient Flow
Covers constant error carousel, forget gate gradient highway, gradient flow analysis, LSTM vs vanilla RNN gradients, peephole connections, LSTM gradient clipping needs.
GRU Architecture
Covers GRU vs LSTM comparison, reset gate function, update gate function, candidate hidden state, GRU equations, GRU parameter efficiency, when to choose GRU vs LSTM.
Bidirectional RNNs
Covers forward and backward passes, hidden state concatenation, bidirectional architectures, bidirectionality for classification, limitations for generation, implementing bidirectional RNNs.
Stacked RNNs
Covers multiple RNN layers, residual connections for depth, layer normalization in RNNs, depth vs width trade-offs, gradient flow in deep RNNs, practical depth limits.
Part IX: Sequence-to-Sequence
7 chapters
Part IX: Sequence-to-Sequence
Encoder-Decoder Framework
Covers encoder role and design, decoder role and design, context vector as bottleneck, seq2seq for machine translation, seq2seq for summarization, seq2seq training setup.
Teacher Forcing
Covers teacher forcing procedure, exposure bias problem, teacher forcing efficiency, scheduled sampling, curriculum learning, teacher forcing vs autoregressive training.
Beam Search
Covers greedy decoding limitations, beam search algorithm, beam width selection, length normalization, diverse beam search, beam search implementation, beam search vs sampling.
Attention Intuition
Covers attention as soft lookup, attention weight interpretation, attention for variable-length inputs, attention visualization, attention vs pooling, attention computation overview.
Bahdanau Attention
Covers alignment model formulation, score function (additive), attention weight computation, context vector as weighted sum, attention in decoder, Bahdanau attention implementation.
Luong Attention
Covers dot product attention, general (bilinear) attention, concat attention variant, global vs local attention, Luong vs Bahdanau comparison, attention placement (input vs output).
Copy Mechanism
Covers pointer network motivation, copy probability computation, mixing generation and copying, pointer-generator networks, copy mechanism for summarization, OOV handling with copy.
Part X: Self-Attention
6 chapters
Part X: Self-Attention
Self-Attention Concept
Covers cross-attention vs self-attention, self-attention motivation, all-pairs interaction, self-attention for representation learning, self-attention computational pattern.
Query, Key, Value
Covers QKV intuition (database lookup), projection matrices Wq, Wk, Wv, query-key matching, value retrieval, QKV dimensions and shapes, QKV as learned transformations.
Scaled Dot-Product Attention
Covers dot product for similarity, softmax for normalization, scaling factor derivation (1/√dk), attention output computation, attention in matrix form, attention implementation.
Attention Masking
Covers padding masks, causal (look-ahead) masks, combining multiple masks, mask shapes and broadcasting, efficient masking implementation, custom attention patterns.
Multi-Head Attention
Covers multiple attention heads motivation, head dimension splitting, parallel attention computation, output concatenation and projection, head specialization, multi-head vs single head.
Attention Complexity
Covers O(n²) attention complexity, memory requirements, attention bottleneck in long sequences, FLOPs computation, attention vs RNN complexity, practical scaling limits.
Part XI: Positional Encoding
7 chapters
Part XI: Positional Encoding
Position Problem
Covers transformer position blindness, why position matters for language, position information requirements, position encoding vs position embedding, absolute vs relative position.
Sinusoidal Position Encoding
Covers sinusoidal formula derivation, wavelength intuition, position encoding visualization, extrapolation properties, sinusoidal encoding implementation, learned vs sinusoidal trade-offs.
Learned Position Embeddings
Covers position embedding table, position embedding training, maximum sequence length, learned embedding extrapolation, position embedding analysis, GPT-style position embeddings.
Relative Position Encoding
Covers relative position motivation, relative attention formulation, clipping relative positions, relative position in self-attention, Shaw et al. relative positions, relative bias implementation.
Rotary Position Embedding (RoPE)
Covers RoPE intuition, rotation matrix formulation, RoPE in complex numbers, relative position through rotation, RoPE implementation, RoPE frequency patterns.
ALiBi
Covers ALiBi motivation, linear bias by distance, head-specific slopes, ALiBi extrapolation properties, ALiBi simplicity advantages, ALiBi vs RoPE comparison.
Position Encoding Comparison
Covers extrapolation benchmarks, training efficiency comparison, implementation complexity, position encoding for long context, hybrid approaches, current best practices.
Part XII: Transformer Blocks
8 chapters
Part XII: Transformer Blocks
Residual Connections
Covers residual connection formulation, gradient highway interpretation, residual scaling, residual connections in transformers, pre-norm vs post-norm residuals.
Layer Normalization
Covers layer norm vs batch norm, layer norm formula, learnable affine parameters, layer norm placement, layer norm gradient flow, layer norm implementation.
RMSNorm
Covers RMSNorm derivation, removing mean centering, RMSNorm efficiency, RMSNorm vs LayerNorm performance, RMSNorm in modern architectures.
Pre-Norm vs Post-Norm
Covers original transformer (post-norm), pre-norm formulation, training stability comparison, gradient flow differences, when to use each, modern consensus.
Feed-Forward Networks
Covers FFN architecture, hidden dimension expansion, FFN as two linear layers, position independence, FFN parameter count, FFN computational cost.
FFN Activation Functions
Covers ReLU in original transformer, GELU adoption, GELU approximations, SiLU/Swish in modern models, activation function comparison.
Gated Linear Units
Covers GLU formulation, gating mechanism, SwiGLU derivation, GeGLU variant, GLU parameter efficiency, GLU in modern architectures.
Transformer Block Assembly
Covers standard block structure, component ordering, block implementation, block initialization, forward pass walkthrough, block hyperparameters.
Part XIII: Transformer Architectures
6 chapters
Part XIII: Transformer Architectures
Encoder Architecture
Covers encoder-only design, bidirectional self-attention, encoder for understanding tasks, encoder output usage, BERT-style encoder, encoder layer stacking.
Decoder Architecture
Covers decoder-only design, causal masking requirement, autoregressive generation, decoder for generation tasks, GPT-style decoder, decoder layer stacking.
Encoder-Decoder Architecture
Covers encoder-decoder interaction, cross-attention mechanism, encoder-decoder for seq2seq, T5-style architecture, information flow, when to use encoder-decoder.
Cross-Attention
Covers cross-attention formulation, KV from encoder, Q from decoder, cross-attention masking, cross-attention placement, cross-attention implementation.
Weight Tying
Covers input-output embedding tying, encoder-decoder tying, parameter reduction, weight tying effects on training, when to tie weights.
Architecture Hyperparameters
Covers depth vs width trade-offs, number of heads selection, hidden dimension ratios, FFN expansion ratio, total parameter calculation, architecture search.
Part XIV: Efficient Attention
9 chapters
Part XIV: Efficient Attention
Quadratic Attention Bottleneck
Covers O(n²) memory analysis, O(n²) compute analysis, attention matrix size, practical sequence limits, bottleneck visualization, motivation for efficiency.
Sparse Attention Patterns
Covers local attention windows, strided attention patterns, block-sparse attention, combining sparse patterns, sparse attention implementation.
Sliding Window Attention
Covers sliding window formulation, window size selection, dilated sliding windows, sliding window for long sequences, Mistral-style windowed attention.
Global Tokens
Covers CLS token global attention, learned global tokens, global-local attention mixing, global token count, implementation strategies.
Longformer
Covers Longformer attention pattern, global attention configuration, Longformer complexity, Longformer for documents, Longformer implementation.
BigBird
Covers BigBird attention pattern, random attention benefits, BigBird theoretical guarantees, BigBird vs Longformer, BigBird applications.
Linear Attention
Covers softmax attention reformulation, kernel feature maps, linear complexity attention, linear attention limitations, Performer and variants.
FlashAttention Algorithm
Covers GPU memory hierarchy, tiling for SRAM, online softmax computation, recomputation strategy, FlashAttention complexity, FlashAttention benefits.
FlashAttention Implementation
Covers CUDA kernel basics, memory access patterns, FlashAttention-2 improvements, using FlashAttention in PyTorch, FlashAttention limitations.
Part XV: Long Context
7 chapters
Part XV: Long Context
Context Length Challenges
Covers training sequence length limits, attention memory scaling, position encoding extrapolation, long-range dependency learning, evaluation challenges.
Position Interpolation
Covers linear position scaling, interpolation vs extrapolation, position interpolation implementation, fine-tuning for longer context, interpolation limitations.
NTK-aware Scaling
Covers RoPE frequency analysis, high-frequency preservation, NTK-aware formula, dynamic NTK scaling, NTK vs linear interpolation.
YaRN
Covers YaRN motivation, attention scaling factor, YaRN formula, YaRN training requirements, YaRN vs alternatives.
Attention Sinks
Covers attention sink phenomenon, StreamingLLM approach, sink token design, streaming inference, infinite context generation.
Memory Augmentation
Covers memory network concepts, memory retrieval mechanisms, memory writing and updating, memory-augmented transformers, Memorizing Transformers.
Recurrent Memory
Covers Transformer-XL approach, segment-level processing, recurrent state passing, relative position in recurrence, recurrent memory limitations.
Part XVI: Pre-training Objectives
7 chapters
Part XVI: Pre-training Objectives
Causal Language Modeling
Covers CLM objective formulation, autoregressive factorization, CLM loss computation, CLM for generation, CLM training data, CLM scaling properties.
Masked Language Modeling
Covers MLM objective formulation, masking strategies (15% rule), [MASK] token usage, MLM for understanding, MLM training dynamics.
Whole Word Masking
Covers subword masking problems, whole word masking procedure, WWM implementation, WWM vs random masking, WWM for different tokenizers.
Span Corruption
Covers span selection strategies, span length distribution, sentinel tokens, T5-style corruption, span corruption benefits.
Prefix Language Modeling
Covers prefix LM formulation, prefix LM attention pattern, prefix LM for generation, prefix LM training, UniLM-style objectives.
Replaced Token Detection
Covers generator-discriminator setup, replaced vs original detection, RTD efficiency advantages, ELECTRA training procedure, RTD vs MLM comparison.
Denoising Objectives
Covers token deletion, token shuffling, sentence permutation, document rotation, BART-style denoising, combining denoising tasks.
Part XVII: BERT and Variants
8 chapters
Part XVII: BERT and Variants
BERT Architecture
Covers BERT model sizes, BERT layer configuration, BERT embedding layers, BERT attention patterns, BERT output representations.
BERT Pre-training
Covers pre-training data preparation, MLM implementation, NSP task design, pre-training hyperparameters, pre-training duration.
BERT Fine-tuning
Covers classification fine-tuning, sequence labeling fine-tuning, question answering fine-tuning, fine-tuning hyperparameters, catastrophic forgetting.
BERT Representations
Covers [CLS] token usage, layer selection strategies, pooling strategies, BERT as feature extractor, frozen vs fine-tuned representations.
RoBERTa
Covers dynamic masking, NSP removal, larger batches, more data, RoBERTa training recipe, RoBERTa vs BERT performance.
ALBERT
Covers factorized embeddings, cross-layer parameter sharing, sentence order prediction, ALBERT efficiency, ALBERT performance trade-offs.
ELECTRA
Covers generator training, discriminator training, RTD objective, ELECTRA sample efficiency, ELECTRA scaling, ELECTRA fine-tuning.
DeBERTa
Covers disentangled attention formulation, enhanced mask decoder, DeBERTa position encoding, DeBERTa improvements, DeBERTa-v3 advances.
Part XVIII: GPT Architecture
10 chapters
Part XVIII: GPT Architecture
GPT-1
Covers GPT-1 architecture, GPT-1 pre-training, GPT-1 fine-tuning approach, GPT-1 transfer learning, GPT-1 historical significance.
GPT-2
Covers GPT-2 model sizes, GPT-2 architectural changes, zero-shot task performance, GPT-2 training data (WebText), GPT-2 generation quality.
GPT-3
Covers GPT-3 scale (175B), few-shot prompting discovery, in-context learning analysis, GPT-3 capabilities, GPT-3 limitations.
In-Context Learning
Covers ICL phenomenon, ICL vs fine-tuning, example selection strategies, ICL scaling behavior, ICL theoretical understanding.
Autoregressive Generation
Covers generation procedure, KV caching for efficiency, generation stopping criteria, generation speed optimization, generation code implementation.
Decoding Temperature
Covers temperature scaling, temperature effects on distribution, temperature selection guidelines, temperature vs quality trade-off.
Top-k Sampling
Covers top-k truncation, k selection strategies, top-k limitations, top-k implementation, combining with temperature.
Nucleus Sampling
Covers top-p formulation, cumulative probability threshold, nucleus sampling benefits, p selection guidelines, nucleus vs top-k.
Repetition Penalties
Covers repetition in generation, repetition penalty formulation, frequency penalty, presence penalty, n-gram blocking.
Constrained Decoding
Covers grammar-guided generation, JSON schema constraints, regex constraints, constrained beam search, constrained sampling.
Part XIX: Modern Decoder Models
7 chapters
Part XIX: Modern Decoder Models
LLaMA Architecture
Covers LLaMA design philosophy, LLaMA architectural choices, LLaMA training data, LLaMA efficiency, LLaMA significance.
LLaMA Components
Covers pre-norm with RMSNorm, SwiGLU FFN, RoPE implementation, component interactions, implementation details.
Grouped Query Attention
Covers GQA motivation, GQA formulation, KV head grouping, GQA memory savings, GQA vs MHA performance, GQA implementation.
Multi-Query Attention
Covers MQA extreme sharing, MQA memory benefits, MQA quality trade-offs, MQA for inference, MQA vs GQA.
Mistral Architecture
Covers Mistral design choices, sliding window attention, Mistral efficiency, Mistral performance, Mistral vs LLaMA.
Qwen Architecture
Covers Qwen architectural choices, Qwen training approach, Qwen multilingual capabilities, Qwen variants.
Phi Models
Covers Phi design philosophy, textbook-quality data, Phi training approach, Phi efficiency, small model capabilities.
Part XX: Encoder-Decoder Models
6 chapters
Part XX: Encoder-Decoder Models
T5 Architecture
Covers T5 encoder-decoder design, T5 attention patterns, T5 model sizes, T5 relative positions, T5 implementation.
T5 Pre-training
Covers span corruption procedure, sentinel tokens, corruption rate, T5 pre-training data, T5 training scale.
T5 Task Formatting
Covers task prefixes, classification as generation, NER as generation, QA as generation, task formatting examples.
BART Architecture
Covers BART encoder-decoder, BART attention configuration, BART vs T5 comparison, BART model sizes.
BART Pre-training
Covers token masking, token deletion, text infilling, sentence permutation, document rotation, objective combinations.
mT5
Covers mT5 training data, language sampling, cross-lingual transfer, mT5 vs T5 performance, multilingual tokenization.
Part XXI: Scaling Laws
7 chapters
Part XXI: Scaling Laws
Power Laws in Deep Learning
Covers power law definition, log-log linear relationships, power law fitting, power law universality, power law intuition.
Kaplan Scaling Laws
Covers loss vs parameters, loss vs data, loss vs compute, Kaplan optimal allocation, Kaplan predictions.
Chinchilla Scaling Laws
Covers Chinchilla experiments, revised scaling coefficients, optimal tokens per parameter, Chinchilla vs Kaplan, Chinchilla implications.
Compute-Optimal Training
Covers compute budget allocation, tokens vs parameters ratio, training efficiency, compute-optimal recipes, practical guidelines.
Data-Constrained Scaling
Covers data repetition effects, optimal repetition strategies, data augmentation scaling, synthetic data scaling.
Inference Scaling
Covers training vs inference compute, inference-optimal models, over-training for efficiency, deployment cost modeling.
Predicting Model Performance
SoonCovers loss extrapolation, capability prediction, scaling law uncertainty, prediction reliability, practical forecasting.
Part XXII: Emergent Capabilities
6 chapters
Part XXII: Emergent Capabilities
Emergence in Neural Networks
SoonCovers emergence definition, phase transitions, emergence examples, emergence mechanisms, emergence debate.
In-Context Learning Emergence
SoonCovers ICL emergence curves, ICL vs fine-tuning scaling, ICL mechanism hypotheses, ICL as meta-learning.
Chain-of-Thought Emergence
SoonCovers CoT emergence observations, CoT elicitation, CoT scaling behavior, CoT mechanism theories.
Emergence vs Metrics
SoonCovers discontinuous metrics, accuracy threshold effects, smooth underlying capabilities, re-examining emergence claims.
Inverse Scaling
SoonCovers inverse scaling phenomena, distractor tasks, sycophancy scaling, inverse scaling prize findings.
Grokking
SoonCovers grokking phenomenon, grokking in arithmetic, grokking mechanism theories, grokking phase transitions, practical implications.
Part XXIII: Mixture of Experts
10 chapters
Part XXIII: Mixture of Experts
Sparse Models
SoonCovers dense vs sparse trade-offs, conditional computation motivation, sparse model efficiency, sparse model challenges.
Expert Networks
SoonCovers expert architecture, expert as FFN, expert capacity, expert count selection, expert placement in transformer.
Gating Networks
SoonCovers router architecture, routing score computation, router training, router learned behavior.
Top-K Routing
SoonCovers top-1 routing, top-2 routing, k selection trade-offs, routing implementation, combining expert outputs.
Load Balancing
SoonCovers expert utilization imbalance, collapse failure mode, load metrics, balanced routing importance.
Auxiliary Balancing Loss
SoonCovers load balancing loss formulation, loss coefficient tuning, balancing vs task loss, auxiliary loss implementation.
Router Z-Loss
SoonCovers router instability, z-loss formulation, z-loss benefits, z-loss coefficient, combined auxiliary losses.
Expert Parallelism
SoonCovers expert placement strategies, all-to-all communication, communication overhead, expert parallelism implementation.
Switch Transformer
SoonCovers Switch Transformer design, top-1 routing choice, capacity factor, Switch scaling results.
Mixtral
SoonCovers Mixtral architecture, Mixtral expert design, Mixtral performance, Mixtral efficiency, Mixtral vs dense models.
Part XXIV: Fine-tuning Fundamentals
5 chapters
Part XXIV: Fine-tuning Fundamentals
Transfer Learning
SoonCovers transfer learning paradigm, pre-training/fine-tuning split, what transfers, transfer learning efficiency.
Full Fine-tuning
SoonCovers full fine-tuning procedure, fine-tuning hyperparameters, learning rate selection, batch size effects.
Catastrophic Forgetting
SoonCovers forgetting phenomenon, forgetting measurement, forgetting mitigation, pre-trained capability preservation.
Fine-tuning Learning Rates
SoonCovers discriminative fine-tuning, layer-wise learning rates, warmup for fine-tuning, learning rate decay.
Fine-tuning Data Efficiency
SoonCovers few-shot fine-tuning, data augmentation, sample efficiency patterns, small data strategies.
Part XXV: Parameter-Efficient Fine-tuning
12 chapters
Part XXV: Parameter-Efficient Fine-tuning
PEFT Motivation
SoonCovers parameter storage costs, multi-task deployment, PEFT efficiency, PEFT quality trade-offs.
LoRA Concept
SoonCovers weight update decomposition, low-rank assumption, LoRA efficiency gains, LoRA flexibility.
LoRA Mathematics
SoonCovers LoRA formulation W + BA, rank selection, initialization scheme, LoRA gradient computation.
LoRA Implementation
SoonCovers LoRA module design, merging weights, LoRA training loop, LoRA in PyTorch, HuggingFace PEFT usage.
LoRA Hyperparameters
SoonCovers rank selection guidelines, alpha/rank ratio, which layers to adapt, LoRA dropout.
QLoRA
SoonCovers 4-bit quantization for base model, NF4 data type, double quantization, QLoRA memory savings.
AdaLoRA
SoonCovers importance-based pruning, SVD-based adaptation, dynamic rank, AdaLoRA training procedure.
IA3
SoonCovers IA3 formulation, learned rescaling vectors, IA3 parameter efficiency, IA3 vs LoRA.
Prefix Tuning
SoonCovers prefix tuning formulation, prefix length selection, prefix tuning for generation, prefix vs LoRA.
Prompt Tuning
SoonCovers prompt tuning formulation, prompt initialization, prompt tuning scaling, prompt length effects.
Adapter Layers
SoonCovers adapter architecture, adapter placement, adapter dimensionality, adapter fusion.
PEFT Comparison
SoonCovers performance comparison, parameter efficiency comparison, task suitability, practical recommendations.
Part XXVI: Instruction Tuning
6 chapters
Part XXVI: Instruction Tuning
Instruction Following
SoonCovers instruction tuning motivation, instruction format design, instruction diversity, instruction quality.
Instruction Data Creation
SoonCovers human annotation, template-based generation, seed task expansion, quality filtering.
Self-Instruct
SoonCovers self-instruct procedure, instruction generation, response generation, filtering strategies.
Instruction Format
SoonCovers prompt templates, system messages, multi-turn format, chat templates, role definitions.
Instruction Tuning Training
SoonCovers instruction tuning data mixing, training hyperparameters, loss masking, multi-task learning.
Instruction Following Evaluation
SoonCovers instruction following benchmarks, human evaluation, automatic evaluation, instruction difficulty.
Part XXVII: Alignment and RLHF
16 chapters
Part XXVII: Alignment and RLHF
Alignment Problem
SoonCovers alignment definition, helpfulness vs harmlessness, alignment challenges, alignment approaches overview.
Human Preference Data
SoonCovers preference collection UI, comparison design, annotator guidelines, preference data quality.
Bradley-Terry Model
SoonCovers pairwise comparison model, preference probability, Bradley-Terry likelihood, preference strength.
Reward Modeling
SoonCovers reward model architecture, preference loss function, reward model training, reward model evaluation.
Reward Hacking
SoonCovers reward hacking examples, distribution shift, over-optimization, reward hacking mitigation.
Policy Gradient Methods
SoonCovers policy definition, REINFORCE algorithm, policy gradient derivation, variance reduction.
PPO Algorithm
SoonCovers clipped objective, PPO derivation, trust region intuition, PPO implementation.
PPO for Language Models
SoonCovers LLM as policy, action space (tokens), reward assignment, KL penalty importance.
RLHF Pipeline
SoonCovers SFT stage, reward model training, PPO fine-tuning, RLHF hyperparameters, RLHF debugging.
KL Divergence Penalty
SoonCovers KL penalty motivation, KL coefficient selection, adaptive KL, KL effects on training.
DPO Concept
SoonCovers DPO motivation, removing reward model, DPO intuition, DPO benefits.
DPO Derivation
SoonCovers DPO from RLHF objective, optimal policy derivation, DPO loss function, DPO as classification.
DPO Implementation
SoonCovers DPO data format, DPO loss computation, DPO training procedure, DPO hyperparameters.
DPO Variants
SoonCovers IPO formulation, KTO for unpaired feedback, ORPO, cDPO, comparing alignment methods.
RLAIF
SoonCovers AI as annotator, constitutional AI principles, AI preference generation, RLAIF scalability.
Iterative Alignment
SoonCovers iterative DPO, online preference learning, self-improvement loops, alignment stability.
Part XXVIII: Inference Optimization
14 chapters
Part XXVIII: Inference Optimization
KV Cache
SoonCovers KV cache motivation, cache structure, cache memory requirements, cache management.
KV Cache Memory
SoonCovers cache size calculation, batch size effects, sequence length effects, memory bottleneck.
Paged Attention
SoonCovers memory fragmentation problem, page-based allocation, vLLM approach, paged attention benefits.
KV Cache Compression
SoonCovers cache eviction strategies, attention sink preservation, H2O algorithm, cache quantization.
Weight Quantization Basics
SoonCovers quantization fundamentals, per-tensor vs per-channel, symmetric vs asymmetric, calibration.
INT8 Quantization
SoonCovers INT8 range mapping, absmax quantization, smooth quantization, INT8 accuracy.
INT4 Quantization
SoonCovers 4-bit challenges, group-wise quantization, 4-bit accuracy trade-offs, 4-bit formats.
GPTQ
SoonCovers GPTQ algorithm, layer-wise quantization, Hessian approximation, GPTQ implementation.
AWQ
SoonCovers salient weight preservation, AWQ algorithm, AWQ vs GPTQ, AWQ benefits.
GGUF Format
SoonCovers GGML/GGUF history, quantization types, GGUF file format, llama.cpp integration.
Speculative Decoding
SoonCovers speculative decoding concept, draft model selection, verification procedure, acceptance rate.
Speculative Decoding Math
SoonCovers acceptance criterion, expected speedup, draft quality effects, optimal draft length.
Continuous Batching
SoonCovers static vs continuous batching, iteration-level scheduling, request completion handling, throughput gains.
Inference Serving
SoonCovers inference server architecture, request routing, load balancing, auto-scaling, latency optimization.
Part XXIX: Retrieval-Augmented Generation
14 chapters
Part XXIX: Retrieval-Augmented Generation
RAG Motivation
SoonCovers knowledge limitations, parametric vs non-parametric, RAG benefits, RAG use cases.
RAG Architecture
SoonCovers retriever component, generator component, retrieval timing, architecture variations.
Dense Retrieval
SoonCovers bi-encoder architecture, embedding similarity, dense vs sparse retrieval, dense retrieval training.
Contrastive Learning for Retrieval
SoonCovers contrastive loss, in-batch negatives, hard negative mining, DPR training procedure.
Document Chunking
SoonCovers chunking strategies, chunk size selection, overlap handling, semantic chunking.
Embedding Models
SoonCovers embedding model architectures, pooling strategies, embedding dimensions, embedding model selection.
Vector Similarity Search
SoonCovers distance metrics, exact vs approximate search, complexity trade-offs, similarity search libraries.
HNSW Index
SoonCovers HNSW algorithm, graph construction, search procedure, HNSW parameters.
IVF Index
SoonCovers clustering approach, probe count, IVF-PQ combination, IVF vs HNSW.
Product Quantization
SoonCovers PQ algorithm, codebook learning, PQ accuracy trade-offs, PQ for scale.
Hybrid Search
SoonCovers BM25 + dense fusion, reciprocal rank fusion, weighted combination, hybrid benefits.
Reranking
SoonCovers cross-encoder architecture, reranking procedure, reranker training, reranker latency.
RAG Prompt Engineering
SoonCovers context placement, citation formats, context truncation, instruction design.
RAG Evaluation
SoonCovers retrieval metrics, generation metrics, end-to-end evaluation, RAGAS framework.
Part XXX: Tool Use and Agents
7 chapters
Part XXX: Tool Use and Agents
Tool Use Motivation
SoonCovers LLM limitations, tool augmentation, tool use examples, tool use benefits.
Function Calling
SoonCovers function schema definition, function call generation, function output handling, function calling fine-tuning.
ReAct Pattern
SoonCovers ReAct formulation, thought-action-observation loop, ReAct prompting, ReAct examples.
Tool Selection
SoonCovers tool descriptions, tool routing, multi-tool scenarios, tool selection training.
Agent Architectures
SoonCovers agent loop design, state management, planning strategies, agent termination.
Agent Memory
SoonCovers short-term memory, long-term memory, memory retrieval, memory summarization.
Agent Evaluation
SoonCovers task completion metrics, trajectory evaluation, agent benchmarks, safety evaluation.
Part XXXI: Multimodal Models
8 chapters
Part XXXI: Multimodal Models
Vision Transformer
SoonCovers image patching, patch embeddings, ViT architecture, ViT pre-training.
CLIP
SoonCovers CLIP architecture, CLIP training objective, CLIP zero-shot classification, CLIP embeddings.
Vision Encoders for VLMs
SoonCovers ViT variants for VLMs, SigLIP improvements, image resolution handling, encoder selection.
Vision-Language Projection
SoonCovers linear projection, MLP projection, Q-Former approach, projection training.
LLaVA Architecture
SoonCovers LLaVA design, two-stage training, visual conversation, LLaVA variants.
Flamingo Architecture
SoonCovers cross-attention to images, gated cross-attention, few-shot visual learning, Flamingo training.
Multimodal Training Data
SoonCovers image-text pairs, interleaved documents, visual instruction data, data quality.
Multimodal Evaluation
SoonCovers VQA benchmarks, multimodal understanding benchmarks, multimodal generation evaluation.
Part XXXII: Speech and Audio
5 chapters
Part XXXII: Speech and Audio
Speech Representations
SoonCovers mel spectrograms, mel filterbanks, feature normalization, audio preprocessing.
Whisper Architecture
SoonCovers Whisper encoder-decoder, multitask training, language tokens, timestamp prediction.
Whisper Training
SoonCovers Whisper training data, weak supervision, multilingual training, Whisper capabilities.
Speech-Language Integration
SoonCovers speech encoder + LLM, audio tokens, speech-to-text-to-LLM vs end-to-end, speech LLM architectures.
Text-to-Speech
SoonCovers TTS architecture overview, vocoder role, TTS quality metrics, neural TTS approaches.
Part XXXIII: Evaluation Fundamentals
7 chapters
Part XXXIII: Evaluation Fundamentals
Perplexity Evaluation
SoonCovers perplexity calculation, perplexity interpretation, perplexity limitations, comparing perplexities.
Cross-Entropy Loss
SoonCovers cross-entropy definition, bits-per-character, cross-entropy vs perplexity, loss curves.
BLEU Score
SoonCovers n-gram precision, brevity penalty, BLEU formula, BLEU limitations, corpus vs sentence BLEU.
ROUGE Scores
SoonCovers ROUGE-N, ROUGE-L, ROUGE-W, ROUGE interpretation, ROUGE limitations.
BERTScore
SoonCovers BERTScore computation, token alignment, BERTScore variants, BERTScore vs BLEU.
Exact Match and F1
SoonCovers exact match scoring, token-level F1, normalization for matching, metric selection.
Calibration
SoonCovers calibration definition, expected calibration error, calibration plots, calibration methods.
Part XXXIV: Benchmark Evaluation
8 chapters
Part XXXIV: Benchmark Evaluation
MMLU
SoonCovers MMLU structure, subject coverage, MMLU evaluation protocol, MMLU limitations.
HellaSwag
SoonCovers HellaSwag task design, adversarial filtering, HellaSwag evaluation, HellaSwag saturation.
GSM8K
SoonCovers GSM8K problem types, chain-of-thought evaluation, GSM8K accuracy metrics, math reasoning assessment.
HumanEval
SoonCovers HumanEval structure, functional correctness, pass@k metric, HumanEval limitations.
MBPP
SoonCovers MBPP dataset, MBPP vs HumanEval, code evaluation challenges.
TruthfulQA
SoonCovers TruthfulQA design, truthfulness vs informativeness, TruthfulQA evaluation methods.
Benchmark Contamination
SoonCovers contamination problem, contamination detection methods, n-gram overlap analysis, contamination mitigation.
Benchmark Saturation
SoonCovers ceiling effects, benchmark retirement, dynamic benchmarks, benchmark evolution.
Part XXXV: Human and Model Evaluation
6 chapters
Part XXXV: Human and Model Evaluation
Human Evaluation Design
SoonCovers evaluation interface design, task instructions, annotator selection, evaluation cost.
Inter-Annotator Agreement
SoonCovers Cohen's kappa, Fleiss' kappa, Krippendorff's alpha, handling disagreement.
Preference Evaluation
SoonCovers A/B comparison design, Elo rating systems, preference aggregation, statistical significance.
LLM-as-Judge
SoonCovers judge prompt design, judge model selection, judge calibration, judge limitations.
Position Bias in LLM Judges
SoonCovers position bias measurement, bias mitigation (swapping), verbosity bias, sycophancy.
Evaluation Prompt Engineering
SoonCovers prompt sensitivity, evaluation prompt design, few-shot vs zero-shot evaluation, evaluation consistency.
Part XXXVI: Bias and Fairness
5 chapters
Part XXXVI: Bias and Fairness
Bias in Language Models
SoonCovers bias sources, bias types (demographic, cultural), bias in training data, bias amplification.
Bias Measurement
SoonCovers embedding association tests, generation bias metrics, classification bias metrics, bias benchmarks.
Bias Mitigation
SoonCovers data balancing, fine-tuning for fairness, prompt-based mitigation, debiasing embeddings.
Fairness Metrics
SoonCovers demographic parity, equalized odds, fairness trade-offs, choosing fairness metrics.
Representation Harms
SoonCovers stereotyping, erasure, demeaning associations, measuring representation harms.
Part XXXVII: Hallucination and Factuality
6 chapters
Part XXXVII: Hallucination and Factuality
Hallucination Types
SoonCovers intrinsic vs extrinsic hallucination, factual errors, fabrication, inconsistency.
Hallucination Detection
SoonCovers entailment-based detection, knowledge base verification, self-consistency checks, detection models.
Hallucination Causes
SoonCovers training data issues, exposure bias, knowledge gaps, generation pressure.
Hallucination Mitigation
SoonCovers retrieval augmentation, decoding strategies, training approaches, uncertainty expression.
Attribution and Citation
SoonCovers inline citation, attribution accuracy, source verification, attribution evaluation.
Uncertainty Quantification
SoonCovers confidence calibration, verbalized uncertainty, sampling-based uncertainty, uncertainty communication.
Part XXXVIII: Safety and Security
8 chapters
Part XXXVIII: Safety and Security
Safety Risks
SoonCovers harmful content generation, misuse scenarios, unintended harms, safety threat models.
Red Teaming
SoonCovers red team methodology, attack taxonomies, red team findings, red team automation.
Jailbreaking
SoonCovers jailbreak techniques, prompt injection, adversarial suffixes, jailbreak defenses.
Prompt Injection
SoonCovers direct prompt injection, indirect prompt injection, injection in RAG, injection defenses.
Content Filtering
SoonCovers classification-based filtering, rule-based filtering, filter placement, filter evaluation.
Guardrails
SoonCovers input guardrails, output guardrails, guardrail frameworks, guardrail design.
Memorization and Privacy
SoonCovers memorization measurement, extractable memorization, PII in training data, privacy risks.
Differential Privacy
SoonCovers DP-SGD basics, privacy budget, DP accuracy trade-offs, DP for LLMs.
Part XXXIX: Interpretability
11 chapters
Part XXXIX: Interpretability
Interpretability Goals
SoonCovers debugging, trust, safety, scientific understanding, interpretability approaches overview.
Attention Visualization
SoonCovers attention weight extraction, attention head visualization, attention interpretation caveats, attention tools.
Attention Analysis Limitations
SoonCovers attention vs importance, attention manipulation studies, gradient-based alternatives.
Probing Classifiers
SoonCovers linear probing methodology, probing task design, probing interpretation, control tasks.
Probing Layers
SoonCovers layer selection, representation evolution, task localization, layer probing patterns.
Activation Patching
SoonCovers patching methodology, locating information, patching experiments, causal tracing.
Logit Lens
SoonCovers logit lens concept, intermediate vocabulary projection, tuned lens, lens interpretation.
Sparse Autoencoders
SoonCovers SAE architecture, sparsity constraints, dictionary learning, SAE for LLMs.
Feature Interpretation
SoonCovers feature activation patterns, feature naming, automated interpretation, feature circuits.
Mechanistic Interpretability
SoonCovers circuit analysis, algorithmic tasks, induction heads, mechanistic discoveries.
Activation Steering
SoonCovers steering vectors, activation addition, representation engineering, steering applications.
Part XL: Data Curation
10 chapters
Part XL: Data Curation
Web Crawling
SoonCovers Common Crawl, crawling strategies, robots.txt respect, crawl freshness.
Document Extraction
SoonCovers HTML parsing, boilerplate removal, content extraction, trafilatura and similar tools.
Language Identification
SoonCovers language ID models, multilingual document handling, code-switching, language filtering.
Deduplication
SoonCovers exact deduplication, near-duplicate detection, document vs substring dedup, dedup at scale.
MinHash
SoonCovers MinHash algorithm, Jaccard similarity estimation, MinHash LSH, MinHash implementation.
Quality Filtering
SoonCovers heuristic filters, perplexity filtering, classifier-based filtering, filter thresholds.
Toxicity Filtering
SoonCovers toxicity classifiers, toxicity thresholds, over-filtering risks, toxicity filter evaluation.
PII Removal
SoonCovers PII detection methods, PII removal strategies, PII removal evaluation, privacy preservation.
Data Mixing
SoonCovers domain proportions, quality weighting, data mixing experiments, optimal mixing.
Synthetic Data
SoonCovers synthetic data generation, quality verification, synthetic data diversity, distillation.
Part XLI: Training Infrastructure
11 chapters
Part XLI: Training Infrastructure
GPU Architecture
SoonCovers GPU memory hierarchy, CUDA cores, tensor cores, GPU specifications.
Memory Management
SoonCovers memory breakdown (activations, parameters, gradients, optimizer states), memory estimation, OOM debugging.
Data Parallelism
SoonCovers DDP algorithm, gradient synchronization, all-reduce operations, DDP scaling.
Tensor Parallelism
SoonCovers column parallelism, row parallelism, communication patterns, Megatron-style parallelism.
Pipeline Parallelism
SoonCovers pipeline stages, micro-batching, pipeline bubbles, pipeline schedules (GPipe, 1F1B).
ZeRO Optimization
SoonCovers ZeRO stage 1 (optimizer state partitioning), ZeRO stage 2 (gradient partitioning), ZeRO stage 3 (parameter partitioning), ZeRO memory savings.
FSDP
SoonCovers FSDP concepts, FSDP vs ZeRO, FSDP sharding strategies, FSDP usage.
Activation Checkpointing
SoonCovers checkpointing concept, checkpoint selection, checkpointing overhead, selective checkpointing.
Mixed Precision Training
SoonCovers floating point formats, loss scaling, BF16 advantages, mixed precision implementation.
Communication Optimization
SoonCovers gradient compression, communication overlap, topology-aware communication, NCCL optimization.
Checkpointing and Recovery
SoonCovers checkpoint contents, checkpoint frequency, async checkpointing, fault recovery.
Part XLII: Training Optimization
8 chapters
Part XLII: Training Optimization
Learning Rate Warmup
SoonCovers warmup motivation, linear warmup, warmup duration, warmup for large batches.
Learning Rate Decay
SoonCovers step decay, exponential decay, inverse square root decay, decay scheduling.
Cosine Learning Rate Schedule
SoonCovers cosine decay formula, cosine with restarts, cosine schedule parameters, cosine vs linear.
Large Batch Training
SoonCovers batch size effects, learning rate scaling, batch size limits, LAMB optimizer.
Weight Decay
SoonCovers weight decay formula, decoupled weight decay, weight decay selection, weight decay interaction with Adam.
Gradient Accumulation
SoonCovers accumulation procedure, accumulation steps, accumulation for memory, accumulation correctness.
Training Stability
SoonCovers loss spikes, gradient norm monitoring, stability techniques, training stability debugging.
Hyperparameter Selection
SoonCovers hyperparameter search, hyperparameter transfer, critical vs robust hyperparameters, default recipes.
Part XLIII: Code Generation
6 chapters
Part XLIII: Code Generation
Code LLM Training
SoonCovers code training data, code tokenization, fill-in-the-middle training, code pre-training objectives.
Code Understanding
SoonCovers code explanation, bug detection, code review, code search.
Code Completion
SoonCovers completion context, completion ranking, completion latency, completion UX.
Code Generation
SoonCovers docstring-to-code, test-to-code, code generation strategies, generation quality.
Code Execution
SoonCovers sandboxed execution, execution feedback, iterative refinement, execution safety.
Code Evaluation
SoonCovers functional correctness, pass@k metric, code benchmarks, beyond correctness.
Part XLIV: Production Systems
9 chapters
Part XLIV: Production Systems
Model Serving
SoonCovers serving frameworks, model loading, request handling, serving configuration.
Latency Optimization
SoonCovers latency breakdown, batching latency, streaming responses, latency monitoring.
Throughput Optimization
SoonCovers batch size tuning, GPU utilization, concurrent requests, throughput measurement.
Auto-scaling
SoonCovers scaling metrics, horizontal scaling, scale-up vs scale-out, scaling policies.
Model Routing
SoonCovers model selection, A/B testing, model cascades, routing strategies.
Caching
SoonCovers prompt caching, semantic caching, cache invalidation, cache hit rates.
Monitoring
SoonCovers metrics collection, alerting, logging, dashboards.
Quality Monitoring
SoonCovers output quality metrics, drift detection, regression detection, quality alerts.
Cost Management
SoonCovers cost modeling, cost optimization, cost allocation, cost monitoring.
Part XLV: Continual Learning
5 chapters
Part XLV: Continual Learning
Continual Learning Problem
SoonCovers continual learning definition, catastrophic forgetting, continual learning scenarios.
Regularization Methods
SoonCovers elastic weight consolidation, synaptic intelligence, parameter importance, regularization trade-offs.
Replay Methods
SoonCovers replay buffer design, pseudo-rehearsal, generative replay, replay selection.
Architecture Methods
SoonCovers progressive networks, expert expansion, architecture search, modular approaches.
Continual Learning Evaluation
SoonCovers forward transfer, backward transfer, evaluation protocols, continual benchmarks.
Part XLVI: Model Compression
6 chapters
Part XLVI: Model Compression
Knowledge Distillation
SoonCovers distillation objective, temperature in distillation, teacher selection, distillation for LLMs.
Distillation Variants
SoonCovers feature distillation, attention transfer, progressive distillation, on-policy distillation.
Pruning Basics
SoonCovers weight pruning, structured vs unstructured, pruning criteria, pruning schedule.
Structured Pruning
SoonCovers head pruning, layer pruning, width pruning, structured pruning implementation.
Model Merging
SoonCovers weight averaging, task arithmetic, TIES merging, DARE merging.
Model Merging Applications
SoonCovers multi-task merging, style merging, capability composition, merging evaluation.
Part XLVII: Advanced Topics
11 chapters
Part XLVII: Advanced Topics
Constitutional AI
SoonCovers constitutional principles, critique and revision, CAI training, CAI effectiveness.
Process Reward Models
SoonCovers outcome vs process reward, PRM training, PRM for math, PRM limitations.
Test-Time Compute
SoonCovers multiple sampling, self-consistency, iterative refinement, compute-optimal inference.
Chain-of-Thought
SoonCovers CoT prompting, zero-shot CoT, CoT fine-tuning, CoT limitations.
Self-Consistency
SoonCovers self-consistency procedure, sampling diversity, voting strategies, self-consistency effectiveness.
Tree of Thought
SoonCovers ToT framework, thought generation, thought evaluation, ToT search.
Retrieval-Augmented Training
SoonCovers RETRO architecture, retrieval during training, retrieved context integration.
Long-Form Generation
SoonCovers outline-based generation, hierarchical generation, coherence maintenance, long-form evaluation.
Watermarking
SoonCovers watermarking schemes, statistical detection, watermark robustness, watermark evaluation.
Model Cards
SoonCovers model card contents, intended use documentation, limitation documentation, model card best practices.
Responsible Deployment
SoonCovers release decisions, staged release, access control, deployment monitoring.
In Progress
This comprehensive handbook is currently in development. Each chapter will be published as it's completed, with practical examples, code implementations, and real-world applications.
Reference
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
No spam, unsubscribe anytime.
Create a free account to unlock exclusive features, track your progress, and join the conversation.