
For
Engineers, researchers, students, AI enthusiasts, linguists, product managers, and anyone interested in understanding or building modern language AI systems, from foundational NLP to advanced large language models.
Language AI Handbook
A Complete Guide to Natural Language Processing and Large Language Models: From Classical NLP and Transformer Architecture to Pre-training, Fine-tuning, and Production Deployment
About This Book
Language AI has transformed from an academic curiosity into the defining technology of our era. But beneath the hype of ChatGPT and Claude lies a rich technical landscape that most practitioners only partially understand. This handbook gives you the complete picture, from classical NLP techniques that still matter to the cutting-edge architectures powering today's most capable systems.
Begin with the fundamentals that never go out of style: tokenization, embeddings, and the statistical foundations that inform modern approaches. Then dive deep into the transformer architecture. Learn not just how to use it, but how it actually works. Understand self-attention mathematically, grasp why positional encodings matter, and see how architectural choices like layer normalization affect training dynamics.
Table of Contents
Part I: Part I: Text as Data
5 chapters
Part I: Part I: Text as Data
Character Encoding
Covers ASCII origins and 7-bit limitations, Unicode code points and planes, UTF-8 variable-width encoding scheme, byte order marks and endianness, encoding detection heuristics, common encoding errors and mojibake, practical encoding/decoding in Python.
Text Normalization
Covers Unicode normalization forms (NFC, NFD, NFKC, NFKD), case folding vs lowercasing, accent and diacritic handling, whitespace normalization, ligature expansion, full-width to half-width conversion, implementing a normalization pipeline.
Regular Expressions
Covers regex syntax and metacharacters, character classes and quantifiers, grouping and backreferences, lookahead and lookbehind assertions, greedy vs lazy matching, common NLP patterns (emails, URLs, dates), regex performance considerations.
Sentence Segmentation
Covers period disambiguation challenges, abbreviation handling, rule-based boundary detection, Punkt sentence tokenizer algorithm, evaluation metrics for segmentation, handling edge cases (quotes, parentheses, lists), multilingual segmentation issues.
Word Tokenization
Covers whitespace tokenization limitations, punctuation handling rules, contractions and clitics, language-specific challenges (Chinese, Japanese, German compounds), Penn Treebank tokenization standard, building a rule-based tokenizer, tokenization evaluation.
Part II: Part II: Classical Text Representations
9 chapters
Part II: Part II: Classical Text Representations
Bag of Words
Covers document-term matrix construction, vocabulary building from corpus, word counting and frequency vectors, sparse matrix representation (CSR/CSC formats), vocabulary pruning (min_df, max_df), binary vs count representations, limitations of word order loss.
N-grams
Covers bigram and trigram extraction, n-gram vocabulary explosion, n-gram frequency distributions, Zipf's law in n-grams, character n-grams for robustness, skip-grams and flexible windows, n-gram indexing for search.
N-gram Language Models
Covers Markov assumption and chain rule, maximum likelihood estimation, probability calculation for sequences, handling unseen n-grams, start and end tokens, generating text from n-gram models, model storage and lookup efficiency.
Smoothing Techniques
Covers add-one (Laplace) smoothing, add-k smoothing and tuning, Good-Turing smoothing derivation, Kneser-Ney smoothing intuition and formula, interpolation vs backoff, modified Kneser-Ney, comparing smoothing methods empirically.
Perplexity
Covers cross-entropy definition and derivation, perplexity as branching factor, relationship to bits-per-character, held-out evaluation methodology, perplexity vs downstream performance, comparing models with perplexity, perplexity limitations and caveats.
Term Frequency
Covers raw term frequency, log-scaled term frequency, boolean term frequency, augmented term frequency, L2-normalized frequency vectors, term frequency sparsity patterns, efficient term frequency computation.
Inverse Document Frequency
Covers document frequency calculation, IDF formula derivation, IDF intuition (rare words matter more), smoothed IDF variants, IDF across corpus splits, relationship to information theory, implementing IDF efficiently.
TF-IDF
Covers TF-IDF formula and variants, TF-IDF vector computation, TF-IDF normalization options, BM25 as TF-IDF extension, document similarity with TF-IDF, TF-IDF for feature extraction, sklearn TfidfVectorizer deep dive.
BM25
Covers BM25 derivation from probabilistic IR, saturation parameter k1, length normalization parameter b, BM25+ and BM25L variants, field-weighted BM25, implementing BM25 scoring, BM25 vs TF-IDF empirically.
Part III: Part III: Distributional Semantics
4 chapters
Part III: Part III: Distributional Semantics
The Distributional Hypothesis
Covers Firth's "you shall know a word by the company it keeps," distributional similarity intuition, context window definitions, paradigmatic vs syntagmatic relations, word similarity from distributions, limitations of distributional semantics.
Co-occurrence Matrices
Covers word-word co-occurrence matrices, word-document matrices, context window size effects, weighting by distance, symmetric vs directional contexts, matrix sparsity patterns, efficient construction algorithms.
Pointwise Mutual Information
Covers PMI formula derivation, PMI interpretation as association, positive PMI (PPMI), shifted PPMI variants, PMI matrix properties, PMI vs raw counts comparison, PMI for collocation extraction.
Singular Value Decomposition
Covers SVD mathematical formulation, truncated SVD for dimensionality reduction, LSA (Latent Semantic Analysis), choosing embedding dimensions, SVD computational complexity, randomized SVD for scale, interpreting SVD dimensions.
Part IV: Part IV: Word Embeddings
9 chapters
Part IV: Part IV: Word Embeddings
Skip-gram Model
Covers skip-gram architecture diagram, input/output representations, softmax over vocabulary, skip-gram objective function, training data generation, window size hyperparameter, skip-gram vs CBOW intuition.
CBOW Model
Coming SoonCovers CBOW architecture, context word averaging, CBOW objective function, CBOW vs skip-gram training speed, CBOW for frequent words, implementing CBOW forward pass, CBOW gradient derivation.
Negative Sampling
Coming SoonCovers softmax computational bottleneck, negative sampling objective derivation, sampling distribution (unigram^0.75), number of negatives hyperparameter, negative sampling gradient computation, NCE vs negative sampling, implementing efficient sampling.
Hierarchical Softmax
Coming SoonCovers binary tree construction (Huffman coding), path probability computation, hierarchical softmax objective, gradient computation along paths, tree structure impact on learning, hierarchical softmax vs negative sampling, when to use each approach.
Word2Vec Training
Coming SoonCovers data preprocessing pipeline, subsampling frequent words, learning rate scheduling, minibatch vs online training, convergence monitoring, gensim Word2Vec usage, training from scratch in PyTorch.
Word Analogy
Coming SoonCovers vector arithmetic for analogies, parallelogram model, analogy evaluation datasets, 3CosAdd vs 3CosMul methods, analogy accuracy metrics, limitations of analogy evaluation, what analogies reveal about embeddings.
GloVe
Coming SoonCovers GloVe objective function derivation, weighted least squares formulation, relationship to matrix factorization, weighting function design, bias terms in GloVe, GloVe vs Word2Vec comparison, training GloVe efficiently.
FastText
Coming SoonCovers character n-gram representation, word vector as n-gram sum, FastText architecture, handling OOV words, morphological awareness, FastText for morphologically rich languages, training FastText models.
Embedding Evaluation
Coming SoonCovers intrinsic vs extrinsic evaluation, word similarity datasets (SimLex, WordSim), analogy accuracy, embedding visualization (t-SNE, UMAP), downstream task evaluation, embedding bias detection, evaluation pitfalls.
Part V: Part V: Subword Tokenization
8 chapters
Part V: Part V: Subword Tokenization
The Vocabulary Problem
Coming SoonCovers OOV word problem, vocabulary size explosion, rare word representation, morphological productivity, compound words, code and technical text, the case for subword units.
Byte Pair Encoding
Coming SoonCovers BPE algorithm step-by-step, merge rules learning, vocabulary size control, BPE encoding procedure, BPE decoding procedure, BPE implementation from scratch, BPE hyperparameters.
WordPiece
Coming SoonCovers WordPiece vs BPE differences, likelihood objective for merges, greedy tokenization algorithm, ## prefix notation, WordPiece in BERT, training WordPiece tokenizers, handling unknown characters.
Unigram Language Model Tokenization
Coming SoonCovers unigram LM formulation, EM algorithm for training, Viterbi decoding for tokenization, sampling multiple segmentations, subword regularization, unigram vs BPE comparison, SentencePiece unigram mode.
SentencePiece
Coming SoonCovers treating text as raw bytes, whitespace handling (▁ prefix), BPE and unigram modes, training from raw text, pretokenization elimination, SentencePiece in production, multilingual tokenization.
Tokenizer Training
Coming SoonCovers corpus preparation, vocabulary size selection, special tokens configuration, training with HuggingFace tokenizers, saving and loading tokenizers, tokenizer versioning, domain-specific tokenizers.
Special Tokens
Coming SoonCovers [CLS], [SEP], [PAD], [MASK], [UNK] tokens, beginning/end of sequence tokens, custom special tokens, special token embeddings, token type IDs, handling special tokens in generation.
Tokenization Challenges
Coming SoonCovers number tokenization issues, code tokenization, multilingual text mixing, emoji and Unicode edge cases, tokenization artifacts, adversarial tokenization, measuring tokenization quality.
Part VI: Part VI: Sequence Labeling
8 chapters
Part VI: Part VI: Sequence Labeling
Part-of-Speech Tagging
Coming SoonCovers POS tag sets (Penn Treebank, Universal), POS tagging as classification, contextual disambiguation, POS tagging accuracy metrics, POS tagging for downstream tasks, rule-based vs statistical taggers.
Named Entity Recognition
Coming SoonCovers entity types (PER, ORG, LOC, etc.), NER as sequence labeling, nested entity challenges, entity boundary detection, NER evaluation (exact vs partial match), NER datasets and benchmarks.
BIO Tagging
Coming SoonCovers BIO scheme explanation, BIOES/BILOU variants, converting spans to BIO tags, BIO decoding to spans, handling tagging inconsistencies, BIO for multi-label scenarios, implementing BIO utilities.
Chunking
Coming SoonCovers noun phrase chunking, chunk types (NP, VP, PP), IOB tagging for chunks, chunking vs full parsing, chunking evaluation, chunking as preprocessing, regex chunking with NLTK.
Hidden Markov Models
Coming SoonCovers HMM components (states, observations, transitions), emission and transition probabilities, HMM assumptions (Markov, independence), HMM for POS tagging, HMM parameter estimation, HMM limitations for NLP.
Viterbi Algorithm
Coming SoonCovers optimal path problem formulation, Viterbi recursion derivation, backpointer tracking, Viterbi complexity analysis, log-space computation, implementing Viterbi efficiently, Viterbi for beam search foundation.
Conditional Random Fields
Coming SoonCovers CRF vs HMM comparison, CRF feature functions, log-linear formulation, partition function computation, CRF for NER, CRF inference complexity, neural CRF layers.
CRF Training
Coming SoonCovers CRF log-likelihood objective, forward-backward algorithm, gradient computation, L-BFGS optimization, feature template design, CRF regularization, CRF training convergence.
Part VII: Part VII: Neural Network Foundations
13 chapters
Part VII: Part VII: Neural Network Foundations
Linear Classifiers
Coming SoonCovers linear decision boundaries, weight vectors and bias, dot product interpretation, multiclass classification (softmax), linear classifier limitations, training with gradient descent.
Activation Functions
Coming SoonCovers sigmoid function and saturation, tanh properties, ReLU and dying ReLU, Leaky ReLU and PReLU, ELU and SELU, GELU derivation and properties, Swish and Mish, choosing activation functions.
Multilayer Perceptrons
Coming SoonCovers hidden layers and depth, weight matrices between layers, forward pass computation, representational capacity, MLP for classification, MLP for regression, MLP architecture design.
Loss Functions
Coming SoonCovers cross-entropy loss derivation, MSE for regression, binary vs multiclass cross-entropy, label smoothing, focal loss for imbalance, loss function numerical stability, custom loss functions.
Backpropagation
Coming SoonCovers computational graphs, chain rule review, forward and backward pass, gradient accumulation, backprop complexity analysis, automatic differentiation, implementing backprop from scratch.
Stochastic Gradient Descent
Coming SoonCovers batch vs stochastic gradient descent, minibatch gradient descent, learning rate selection, SGD convergence properties, SGD noise as regularization, learning rate schedules basics, SGD implementation.
Momentum
Coming SoonCovers momentum intuition (ball rolling), momentum update equations, momentum coefficient selection, dampening oscillations, momentum vs vanilla SGD, Nesterov momentum derivation, implementing momentum.
Adam Optimizer
Coming SoonCovers exponential moving averages, first moment (mean) estimation, second moment (variance) estimation, bias correction derivation, Adam update rule, Adam hyperparameters, Adam convergence properties.
AdamW
Coming SoonCovers L2 regularization vs weight decay, why they differ with Adam, AdamW formulation, weight decay coefficient selection, AdamW as default optimizer, AdamW vs Adam empirically.
Weight Initialization
Coming SoonCovers random initialization importance, Xavier/Glorot initialization derivation, He initialization for ReLU, initialization for different activations, layer-wise initialization, initialization debugging, modern initialization practices.
Batch Normalization
Coming SoonCovers internal covariate shift, batch statistics computation, learnable scale and shift, training vs inference mode, batch norm gradient flow, batch norm placement debates, batch norm limitations.
Dropout
Coming SoonCovers dropout as ensemble, dropout mask sampling, inverted dropout scaling, dropout rate selection, dropout at inference, spatial dropout for sequences, dropout in modern architectures.
Gradient Clipping
Coming SoonCovers gradient explosion detection, clip by value, clip by global norm, gradient clipping implementation, when to use gradient clipping, clipping threshold selection, monitoring gradient norms.
Part VIII: Part VIII: Recurrent Neural Networks
9 chapters
Part VIII: Part VIII: Recurrent Neural Networks
RNN Architecture
Coming SoonCovers recurrent connection intuition, hidden state as memory, unrolled computation graph, parameter sharing across time, RNN for sequence classification, RNN for sequence generation, RNN equations and dimensions.
Backpropagation Through Time
Coming SoonCovers BPTT derivation, gradient flow through time, truncated BPTT, BPTT memory requirements, BPTT implementation, gradient accumulation across timesteps.
Vanishing Gradients
Coming SoonCovers gradient product across timesteps, vanishing gradient analysis, long-range dependency failure, gradient visualization, vanishing vs exploding trade-off, architectural solutions overview.
LSTM Architecture
Coming SoonCovers cell state as information highway, gate mechanism intuition, LSTM diagram walkthrough, information flow in LSTMs, LSTM for long sequences, LSTM memory capacity.
LSTM Gate Equations
Coming SoonCovers forget gate equations, input gate equations, cell state update, output gate equations, hidden state computation, LSTM parameter count, implementing LSTM from scratch.
LSTM Gradient Flow
Coming SoonCovers constant error carousel, forget gate gradient highway, gradient flow analysis, LSTM vs vanilla RNN gradients, peephole connections, LSTM gradient clipping needs.
GRU Architecture
Coming SoonCovers GRU vs LSTM comparison, reset gate function, update gate function, candidate hidden state, GRU equations, GRU parameter efficiency, when to choose GRU vs LSTM.
Bidirectional RNNs
Coming SoonCovers forward and backward passes, hidden state concatenation, bidirectional architectures, bidirectionality for classification, limitations for generation, implementing bidirectional RNNs.
Stacked RNNs
Coming SoonCovers multiple RNN layers, residual connections for depth, layer normalization in RNNs, depth vs width trade-offs, gradient flow in deep RNNs, practical depth limits.
Part IX: Part IX: Sequence-to-Sequence
7 chapters
Part IX: Part IX: Sequence-to-Sequence
Encoder-Decoder Framework
Coming SoonCovers encoder role and design, decoder role and design, context vector as bottleneck, seq2seq for machine translation, seq2seq for summarization, seq2seq training setup.
Teacher Forcing
Coming SoonCovers teacher forcing procedure, exposure bias problem, teacher forcing efficiency, scheduled sampling, curriculum learning, teacher forcing vs autoregressive training.
Beam Search
Coming SoonCovers greedy decoding limitations, beam search algorithm, beam width selection, length normalization, diverse beam search, beam search implementation, beam search vs sampling.
Attention Intuition
Coming SoonCovers attention as soft lookup, attention weight interpretation, attention for variable-length inputs, attention visualization, attention vs pooling, attention computation overview.
Bahdanau Attention
Coming SoonCovers alignment model formulation, score function (additive), attention weight computation, context vector as weighted sum, attention in decoder, Bahdanau attention implementation.
Luong Attention
Coming SoonCovers dot product attention, general (bilinear) attention, concat attention variant, global vs local attention, Luong vs Bahdanau comparison, attention placement (input vs output).
Copy Mechanism
Coming SoonCovers pointer network motivation, copy probability computation, mixing generation and copying, pointer-generator networks, copy mechanism for summarization, OOV handling with copy.
Part X: Part X: Self-Attention
6 chapters
Part X: Part X: Self-Attention
Self-Attention Concept
Coming SoonCovers cross-attention vs self-attention, self-attention motivation, all-pairs interaction, self-attention for representation learning, self-attention computational pattern.
Query, Key, Value
Coming SoonCovers QKV intuition (database lookup), projection matrices Wq, Wk, Wv, query-key matching, value retrieval, QKV dimensions and shapes, QKV as learned transformations.
Scaled Dot-Product Attention
Coming SoonCovers dot product for similarity, softmax for normalization, scaling factor derivation (1/√dk), attention output computation, attention in matrix form, attention implementation.
Attention Masking
Coming SoonCovers padding masks, causal (look-ahead) masks, combining multiple masks, mask shapes and broadcasting, efficient masking implementation, custom attention patterns.
Multi-Head Attention
Coming SoonCovers multiple attention heads motivation, head dimension splitting, parallel attention computation, output concatenation and projection, head specialization, multi-head vs single head.
Attention Complexity
Coming SoonCovers O(n²) attention complexity, memory requirements, attention bottleneck in long sequences, FLOPs computation, attention vs RNN complexity, practical scaling limits.
Part XI: Part XI: Positional Encoding
7 chapters
Part XI: Part XI: Positional Encoding
Position Problem
Coming SoonCovers transformer position blindness, why position matters for language, position information requirements, position encoding vs position embedding, absolute vs relative position.
Sinusoidal Position Encoding
Coming SoonCovers sinusoidal formula derivation, wavelength intuition, position encoding visualization, extrapolation properties, sinusoidal encoding implementation, learned vs sinusoidal trade-offs.
Learned Position Embeddings
Coming SoonCovers position embedding table, position embedding training, maximum sequence length, learned embedding extrapolation, position embedding analysis, GPT-style position embeddings.
Relative Position Encoding
Coming SoonCovers relative position motivation, relative attention formulation, clipping relative positions, relative position in self-attention, Shaw et al. relative positions, relative bias implementation.
Rotary Position Embedding (RoPE)
Coming SoonCovers RoPE intuition, rotation matrix formulation, RoPE in complex numbers, relative position through rotation, RoPE implementation, RoPE frequency patterns.
ALiBi
Coming SoonCovers ALiBi motivation, linear bias by distance, head-specific slopes, ALiBi extrapolation properties, ALiBi simplicity advantages, ALiBi vs RoPE comparison.
Position Encoding Comparison
Coming SoonCovers extrapolation benchmarks, training efficiency comparison, implementation complexity, position encoding for long context, hybrid approaches, current best practices.
Part XII: Part XII: Transformer Blocks
8 chapters
Part XII: Part XII: Transformer Blocks
Residual Connections
Coming SoonCovers residual connection formulation, gradient highway interpretation, residual scaling, residual connections in transformers, pre-norm vs post-norm residuals.
Layer Normalization
Coming SoonCovers layer norm vs batch norm, layer norm formula, learnable affine parameters, layer norm placement, layer norm gradient flow, layer norm implementation.
RMSNorm
Coming SoonCovers RMSNorm derivation, removing mean centering, RMSNorm efficiency, RMSNorm vs LayerNorm performance, RMSNorm in modern architectures.
Pre-Norm vs Post-Norm
Coming SoonCovers original transformer (post-norm), pre-norm formulation, training stability comparison, gradient flow differences, when to use each, modern consensus.
Feed-Forward Networks
Coming SoonCovers FFN architecture, hidden dimension expansion, FFN as two linear layers, position independence, FFN parameter count, FFN computational cost.
FFN Activation Functions
Coming SoonCovers ReLU in original transformer, GELU adoption, GELU approximations, SiLU/Swish in modern models, activation function comparison.
Gated Linear Units
Coming SoonCovers GLU formulation, gating mechanism, SwiGLU derivation, GeGLU variant, GLU parameter efficiency, GLU in modern architectures.
Transformer Block Assembly
Coming SoonCovers standard block structure, component ordering, block implementation, block initialization, forward pass walkthrough, block hyperparameters.
Part XIII: Part XIII: Transformer Architectures
6 chapters
Part XIII: Part XIII: Transformer Architectures
Encoder Architecture
Coming SoonCovers encoder-only design, bidirectional self-attention, encoder for understanding tasks, encoder output usage, BERT-style encoder, encoder layer stacking.
Decoder Architecture
Coming SoonCovers decoder-only design, causal masking requirement, autoregressive generation, decoder for generation tasks, GPT-style decoder, decoder layer stacking.
Encoder-Decoder Architecture
Coming SoonCovers encoder-decoder interaction, cross-attention mechanism, encoder-decoder for seq2seq, T5-style architecture, information flow, when to use encoder-decoder.
Cross-Attention
Coming SoonCovers cross-attention formulation, KV from encoder, Q from decoder, cross-attention masking, cross-attention placement, cross-attention implementation.
Weight Tying
Coming SoonCovers input-output embedding tying, encoder-decoder tying, parameter reduction, weight tying effects on training, when to tie weights.
Architecture Hyperparameters
Coming SoonCovers depth vs width trade-offs, number of heads selection, hidden dimension ratios, FFN expansion ratio, total parameter calculation, architecture search.
Part XIV: Part XIV: Efficient Attention
9 chapters
Part XIV: Part XIV: Efficient Attention
Quadratic Attention Bottleneck
Coming SoonCovers O(n²) memory analysis, O(n²) compute analysis, attention matrix size, practical sequence limits, bottleneck visualization, motivation for efficiency.
Sparse Attention Patterns
Coming SoonCovers local attention windows, strided attention patterns, block-sparse attention, combining sparse patterns, sparse attention implementation.
Sliding Window Attention
Coming SoonCovers sliding window formulation, window size selection, dilated sliding windows, sliding window for long sequences, Mistral-style windowed attention.
Global Tokens
Coming SoonCovers CLS token global attention, learned global tokens, global-local attention mixing, global token count, implementation strategies.
Longformer
Coming SoonCovers Longformer attention pattern, global attention configuration, Longformer complexity, Longformer for documents, Longformer implementation.
BigBird
Coming SoonCovers BigBird attention pattern, random attention benefits, BigBird theoretical guarantees, BigBird vs Longformer, BigBird applications.
Linear Attention
Coming SoonCovers softmax attention reformulation, kernel feature maps, linear complexity attention, linear attention limitations, Performer and variants.
FlashAttention Algorithm
Coming SoonCovers GPU memory hierarchy, tiling for SRAM, online softmax computation, recomputation strategy, FlashAttention complexity, FlashAttention benefits.
FlashAttention Implementation
Coming SoonCovers CUDA kernel basics, memory access patterns, FlashAttention-2 improvements, using FlashAttention in PyTorch, FlashAttention limitations.
Part XV: Part XV: Long Context
7 chapters
Part XV: Part XV: Long Context
Context Length Challenges
Coming SoonCovers training sequence length limits, attention memory scaling, position encoding extrapolation, long-range dependency learning, evaluation challenges.
Position Interpolation
Coming SoonCovers linear position scaling, interpolation vs extrapolation, position interpolation implementation, fine-tuning for longer context, interpolation limitations.
NTK-aware Scaling
Coming SoonCovers RoPE frequency analysis, high-frequency preservation, NTK-aware formula, dynamic NTK scaling, NTK vs linear interpolation.
YaRN
Coming SoonCovers YaRN motivation, attention scaling factor, YaRN formula, YaRN training requirements, YaRN vs alternatives.
Attention Sinks
Coming SoonCovers attention sink phenomenon, StreamingLLM approach, sink token design, streaming inference, infinite context generation.
Memory Augmentation
Coming SoonCovers memory network concepts, memory retrieval mechanisms, memory writing and updating, memory-augmented transformers, Memorizing Transformers.
Recurrent Memory
Coming SoonCovers Transformer-XL approach, segment-level processing, recurrent state passing, relative position in recurrence, recurrent memory limitations.
Part XVI: Part XVI: Pre-training Objectives
7 chapters
Part XVI: Part XVI: Pre-training Objectives
Causal Language Modeling
Coming SoonCovers CLM objective formulation, autoregressive factorization, CLM loss computation, CLM for generation, CLM training data, CLM scaling properties.
Masked Language Modeling
Coming SoonCovers MLM objective formulation, masking strategies (15% rule), [MASK] token usage, MLM for understanding, MLM training dynamics.
Whole Word Masking
Coming SoonCovers subword masking problems, whole word masking procedure, WWM implementation, WWM vs random masking, WWM for different tokenizers.
Span Corruption
Coming SoonCovers span selection strategies, span length distribution, sentinel tokens, T5-style corruption, span corruption benefits.
Prefix Language Modeling
Coming SoonCovers prefix LM formulation, prefix LM attention pattern, prefix LM for generation, prefix LM training, UniLM-style objectives.
Replaced Token Detection
Coming SoonCovers generator-discriminator setup, replaced vs original detection, RTD efficiency advantages, ELECTRA training procedure, RTD vs MLM comparison.
Denoising Objectives
Coming SoonCovers token deletion, token shuffling, sentence permutation, document rotation, BART-style denoising, combining denoising tasks.
Part XVII: Part XVII: BERT and Variants
8 chapters
Part XVII: Part XVII: BERT and Variants
BERT Architecture
Coming SoonCovers BERT model sizes, BERT layer configuration, BERT embedding layers, BERT attention patterns, BERT output representations.
BERT Pre-training
Coming SoonCovers pre-training data preparation, MLM implementation, NSP task design, pre-training hyperparameters, pre-training duration.
BERT Fine-tuning
Coming SoonCovers classification fine-tuning, sequence labeling fine-tuning, question answering fine-tuning, fine-tuning hyperparameters, catastrophic forgetting.
BERT Representations
Coming SoonCovers [CLS] token usage, layer selection strategies, pooling strategies, BERT as feature extractor, frozen vs fine-tuned representations.
RoBERTa
Coming SoonCovers dynamic masking, NSP removal, larger batches, more data, RoBERTa training recipe, RoBERTa vs BERT performance.
ALBERT
Coming SoonCovers factorized embeddings, cross-layer parameter sharing, sentence order prediction, ALBERT efficiency, ALBERT performance trade-offs.
ELECTRA
Coming SoonCovers generator training, discriminator training, RTD objective, ELECTRA sample efficiency, ELECTRA scaling, ELECTRA fine-tuning.
DeBERTa
Coming SoonCovers disentangled attention formulation, enhanced mask decoder, DeBERTa position encoding, DeBERTa improvements, DeBERTa-v3 advances.
Part XVIII: Part XVIII: GPT Architecture
10 chapters
Part XVIII: Part XVIII: GPT Architecture
GPT-1
Coming SoonCovers GPT-1 architecture, GPT-1 pre-training, GPT-1 fine-tuning approach, GPT-1 transfer learning, GPT-1 historical significance.
GPT-2
Coming SoonCovers GPT-2 model sizes, GPT-2 architectural changes, zero-shot task performance, GPT-2 training data (WebText), GPT-2 generation quality.
GPT-3
Coming SoonCovers GPT-3 scale (175B), few-shot prompting discovery, in-context learning analysis, GPT-3 capabilities, GPT-3 limitations.
In-Context Learning
Coming SoonCovers ICL phenomenon, ICL vs fine-tuning, example selection strategies, ICL scaling behavior, ICL theoretical understanding.
Autoregressive Generation
Coming SoonCovers generation procedure, KV caching for efficiency, generation stopping criteria, generation speed optimization, generation code implementation.
Decoding Temperature
Coming SoonCovers temperature scaling, temperature effects on distribution, temperature selection guidelines, temperature vs quality trade-off.
Top-k Sampling
Coming SoonCovers top-k truncation, k selection strategies, top-k limitations, top-k implementation, combining with temperature.
Nucleus Sampling
Coming SoonCovers top-p formulation, cumulative probability threshold, nucleus sampling benefits, p selection guidelines, nucleus vs top-k.
Repetition Penalties
Coming SoonCovers repetition in generation, repetition penalty formulation, frequency penalty, presence penalty, n-gram blocking.
Constrained Decoding
Coming SoonCovers grammar-guided generation, JSON schema constraints, regex constraints, constrained beam search, constrained sampling.
Part XIX: Part XIX: Modern Decoder Models
7 chapters
Part XIX: Part XIX: Modern Decoder Models
LLaMA Architecture
Coming SoonCovers LLaMA design philosophy, LLaMA architectural choices, LLaMA training data, LLaMA efficiency, LLaMA significance.
LLaMA Components
Coming SoonCovers pre-norm with RMSNorm, SwiGLU FFN, RoPE implementation, component interactions, implementation details.
Grouped Query Attention
Coming SoonCovers GQA motivation, GQA formulation, KV head grouping, GQA memory savings, GQA vs MHA performance, GQA implementation.
Multi-Query Attention
Coming SoonCovers MQA extreme sharing, MQA memory benefits, MQA quality trade-offs, MQA for inference, MQA vs GQA.
Mistral Architecture
Coming SoonCovers Mistral design choices, sliding window attention, Mistral efficiency, Mistral performance, Mistral vs LLaMA.
Qwen Architecture
Coming SoonCovers Qwen architectural choices, Qwen training approach, Qwen multilingual capabilities, Qwen variants.
Phi Models
Coming SoonCovers Phi design philosophy, textbook-quality data, Phi training approach, Phi efficiency, small model capabilities.
Part XX: Part XX: Encoder-Decoder Models
6 chapters
Part XX: Part XX: Encoder-Decoder Models
T5 Architecture
Coming SoonCovers T5 encoder-decoder design, T5 attention patterns, T5 model sizes, T5 relative positions, T5 implementation.
T5 Pre-training
Coming SoonCovers span corruption procedure, sentinel tokens, corruption rate, T5 pre-training data, T5 training scale.
T5 Task Formatting
Coming SoonCovers task prefixes, classification as generation, NER as generation, QA as generation, task formatting examples.
BART Architecture
Coming SoonCovers BART encoder-decoder, BART attention configuration, BART vs T5 comparison, BART model sizes.
BART Pre-training
Coming SoonCovers token masking, token deletion, text infilling, sentence permutation, document rotation, objective combinations.
mT5
Coming SoonCovers mT5 training data, language sampling, cross-lingual transfer, mT5 vs T5 performance, multilingual tokenization.
Part XXI: Part XXI: Scaling Laws
7 chapters
Part XXI: Part XXI: Scaling Laws
Power Laws in Deep Learning
Coming SoonCovers power law definition, log-log linear relationships, power law fitting, power law universality, power law intuition.
Kaplan Scaling Laws
Coming SoonCovers loss vs parameters, loss vs data, loss vs compute, Kaplan optimal allocation, Kaplan predictions.
Chinchilla Scaling Laws
Coming SoonCovers Chinchilla experiments, revised scaling coefficients, optimal tokens per parameter, Chinchilla vs Kaplan, Chinchilla implications.
Compute-Optimal Training
Coming SoonCovers compute budget allocation, tokens vs parameters ratio, training efficiency, compute-optimal recipes, practical guidelines.
Data-Constrained Scaling
Coming SoonCovers data repetition effects, optimal repetition strategies, data augmentation scaling, synthetic data scaling.
Inference Scaling
Coming SoonCovers training vs inference compute, inference-optimal models, over-training for efficiency, deployment cost modeling.
Predicting Model Performance
Coming SoonCovers loss extrapolation, capability prediction, scaling law uncertainty, prediction reliability, practical forecasting.
Part XXII: Part XXII: Emergent Capabilities
6 chapters
Part XXII: Part XXII: Emergent Capabilities
Emergence in Neural Networks
Coming SoonCovers emergence definition, phase transitions, emergence examples, emergence mechanisms, emergence debate.
In-Context Learning Emergence
Coming SoonCovers ICL emergence curves, ICL vs fine-tuning scaling, ICL mechanism hypotheses, ICL as meta-learning.
Chain-of-Thought Emergence
Coming SoonCovers CoT emergence observations, CoT elicitation, CoT scaling behavior, CoT mechanism theories.
Emergence vs Metrics
Coming SoonCovers discontinuous metrics, accuracy threshold effects, smooth underlying capabilities, re-examining emergence claims.
Inverse Scaling
Coming SoonCovers inverse scaling phenomena, distractor tasks, sycophancy scaling, inverse scaling prize findings.
Grokking
Coming SoonCovers grokking phenomenon, grokking in arithmetic, grokking mechanism theories, grokking phase transitions, practical implications.
Part XXIII: Part XXIII: Mixture of Experts
10 chapters
Part XXIII: Part XXIII: Mixture of Experts
Sparse Models
Coming SoonCovers dense vs sparse trade-offs, conditional computation motivation, sparse model efficiency, sparse model challenges.
Expert Networks
Coming SoonCovers expert architecture, expert as FFN, expert capacity, expert count selection, expert placement in transformer.
Gating Networks
Coming SoonCovers router architecture, routing score computation, router training, router learned behavior.
Top-K Routing
Coming SoonCovers top-1 routing, top-2 routing, k selection trade-offs, routing implementation, combining expert outputs.
Load Balancing
Coming SoonCovers expert utilization imbalance, collapse failure mode, load metrics, balanced routing importance.
Auxiliary Balancing Loss
Coming SoonCovers load balancing loss formulation, loss coefficient tuning, balancing vs task loss, auxiliary loss implementation.
Router Z-Loss
Coming SoonCovers router instability, z-loss formulation, z-loss benefits, z-loss coefficient, combined auxiliary losses.
Expert Parallelism
Coming SoonCovers expert placement strategies, all-to-all communication, communication overhead, expert parallelism implementation.
Switch Transformer
Coming SoonCovers Switch Transformer design, top-1 routing choice, capacity factor, Switch scaling results.
Mixtral
Coming SoonCovers Mixtral architecture, Mixtral expert design, Mixtral performance, Mixtral efficiency, Mixtral vs dense models.
Part XXIV: Part XXIV: Fine-tuning Fundamentals
5 chapters
Part XXIV: Part XXIV: Fine-tuning Fundamentals
Transfer Learning
Coming SoonCovers transfer learning paradigm, pre-training/fine-tuning split, what transfers, transfer learning efficiency.
Full Fine-tuning
Coming SoonCovers full fine-tuning procedure, fine-tuning hyperparameters, learning rate selection, batch size effects.
Catastrophic Forgetting
Coming SoonCovers forgetting phenomenon, forgetting measurement, forgetting mitigation, pre-trained capability preservation.
Fine-tuning Learning Rates
Coming SoonCovers discriminative fine-tuning, layer-wise learning rates, warmup for fine-tuning, learning rate decay.
Fine-tuning Data Efficiency
Coming SoonCovers few-shot fine-tuning, data augmentation, sample efficiency patterns, small data strategies.
Part XXV: Part XXV: Parameter-Efficient Fine-tuning
12 chapters
Part XXV: Part XXV: Parameter-Efficient Fine-tuning
PEFT Motivation
Coming SoonCovers parameter storage costs, multi-task deployment, PEFT efficiency, PEFT quality trade-offs.
LoRA Concept
Coming SoonCovers weight update decomposition, low-rank assumption, LoRA efficiency gains, LoRA flexibility.
LoRA Mathematics
Coming SoonCovers LoRA formulation W + BA, rank selection, initialization scheme, LoRA gradient computation.
LoRA Implementation
Coming SoonCovers LoRA module design, merging weights, LoRA training loop, LoRA in PyTorch, HuggingFace PEFT usage.
LoRA Hyperparameters
Coming SoonCovers rank selection guidelines, alpha/rank ratio, which layers to adapt, LoRA dropout.
QLoRA
Coming SoonCovers 4-bit quantization for base model, NF4 data type, double quantization, QLoRA memory savings.
AdaLoRA
Coming SoonCovers importance-based pruning, SVD-based adaptation, dynamic rank, AdaLoRA training procedure.
IA3
Coming SoonCovers IA3 formulation, learned rescaling vectors, IA3 parameter efficiency, IA3 vs LoRA.
Prefix Tuning
Coming SoonCovers prefix tuning formulation, prefix length selection, prefix tuning for generation, prefix vs LoRA.
Prompt Tuning
Coming SoonCovers prompt tuning formulation, prompt initialization, prompt tuning scaling, prompt length effects.
Adapter Layers
Coming SoonCovers adapter architecture, adapter placement, adapter dimensionality, adapter fusion.
PEFT Comparison
Coming SoonCovers performance comparison, parameter efficiency comparison, task suitability, practical recommendations.
Part XXVI: Part XXVI: Instruction Tuning
6 chapters
Part XXVI: Part XXVI: Instruction Tuning
Instruction Following
Coming SoonCovers instruction tuning motivation, instruction format design, instruction diversity, instruction quality.
Instruction Data Creation
Coming SoonCovers human annotation, template-based generation, seed task expansion, quality filtering.
Self-Instruct
Coming SoonCovers self-instruct procedure, instruction generation, response generation, filtering strategies.
Instruction Format
Coming SoonCovers prompt templates, system messages, multi-turn format, chat templates, role definitions.
Instruction Tuning Training
Coming SoonCovers instruction tuning data mixing, training hyperparameters, loss masking, multi-task learning.
Instruction Following Evaluation
Coming SoonCovers instruction following benchmarks, human evaluation, automatic evaluation, instruction difficulty.
Part XXVII: Part XXVII: Alignment and RLHF
16 chapters
Part XXVII: Part XXVII: Alignment and RLHF
Alignment Problem
Coming SoonCovers alignment definition, helpfulness vs harmlessness, alignment challenges, alignment approaches overview.
Human Preference Data
Coming SoonCovers preference collection UI, comparison design, annotator guidelines, preference data quality.
Bradley-Terry Model
Coming SoonCovers pairwise comparison model, preference probability, Bradley-Terry likelihood, preference strength.
Reward Modeling
Coming SoonCovers reward model architecture, preference loss function, reward model training, reward model evaluation.
Reward Hacking
Coming SoonCovers reward hacking examples, distribution shift, over-optimization, reward hacking mitigation.
Policy Gradient Methods
Coming SoonCovers policy definition, REINFORCE algorithm, policy gradient derivation, variance reduction.
PPO Algorithm
Coming SoonCovers clipped objective, PPO derivation, trust region intuition, PPO implementation.
PPO for Language Models
Coming SoonCovers LLM as policy, action space (tokens), reward assignment, KL penalty importance.
RLHF Pipeline
Coming SoonCovers SFT stage, reward model training, PPO fine-tuning, RLHF hyperparameters, RLHF debugging.
KL Divergence Penalty
Coming SoonCovers KL penalty motivation, KL coefficient selection, adaptive KL, KL effects on training.
DPO Concept
Coming SoonCovers DPO motivation, removing reward model, DPO intuition, DPO benefits.
DPO Derivation
Coming SoonCovers DPO from RLHF objective, optimal policy derivation, DPO loss function, DPO as classification.
DPO Implementation
Coming SoonCovers DPO data format, DPO loss computation, DPO training procedure, DPO hyperparameters.
DPO Variants
Coming SoonCovers IPO formulation, KTO for unpaired feedback, ORPO, cDPO, comparing alignment methods.
RLAIF
Coming SoonCovers AI as annotator, constitutional AI principles, AI preference generation, RLAIF scalability.
Iterative Alignment
Coming SoonCovers iterative DPO, online preference learning, self-improvement loops, alignment stability.
Part XXVIII: Part XXVIII: Inference Optimization
14 chapters
Part XXVIII: Part XXVIII: Inference Optimization
KV Cache
Coming SoonCovers KV cache motivation, cache structure, cache memory requirements, cache management.
KV Cache Memory
Coming SoonCovers cache size calculation, batch size effects, sequence length effects, memory bottleneck.
Paged Attention
Coming SoonCovers memory fragmentation problem, page-based allocation, vLLM approach, paged attention benefits.
KV Cache Compression
Coming SoonCovers cache eviction strategies, attention sink preservation, H2O algorithm, cache quantization.
Weight Quantization Basics
Coming SoonCovers quantization fundamentals, per-tensor vs per-channel, symmetric vs asymmetric, calibration.
INT8 Quantization
Coming SoonCovers INT8 range mapping, absmax quantization, smooth quantization, INT8 accuracy.
INT4 Quantization
Coming SoonCovers 4-bit challenges, group-wise quantization, 4-bit accuracy trade-offs, 4-bit formats.
GPTQ
Coming SoonCovers GPTQ algorithm, layer-wise quantization, Hessian approximation, GPTQ implementation.
AWQ
Coming SoonCovers salient weight preservation, AWQ algorithm, AWQ vs GPTQ, AWQ benefits.
GGUF Format
Coming SoonCovers GGML/GGUF history, quantization types, GGUF file format, llama.cpp integration.
Speculative Decoding
Coming SoonCovers speculative decoding concept, draft model selection, verification procedure, acceptance rate.
Speculative Decoding Math
Coming SoonCovers acceptance criterion, expected speedup, draft quality effects, optimal draft length.
Continuous Batching
Coming SoonCovers static vs continuous batching, iteration-level scheduling, request completion handling, throughput gains.
Inference Serving
Coming SoonCovers inference server architecture, request routing, load balancing, auto-scaling, latency optimization.
Part XXIX: Part XXIX: Retrieval-Augmented Generation
14 chapters
Part XXIX: Part XXIX: Retrieval-Augmented Generation
RAG Motivation
Coming SoonCovers knowledge limitations, parametric vs non-parametric, RAG benefits, RAG use cases.
RAG Architecture
Coming SoonCovers retriever component, generator component, retrieval timing, architecture variations.
Dense Retrieval
Coming SoonCovers bi-encoder architecture, embedding similarity, dense vs sparse retrieval, dense retrieval training.
Contrastive Learning for Retrieval
Coming SoonCovers contrastive loss, in-batch negatives, hard negative mining, DPR training procedure.
Document Chunking
Coming SoonCovers chunking strategies, chunk size selection, overlap handling, semantic chunking.
Embedding Models
Coming SoonCovers embedding model architectures, pooling strategies, embedding dimensions, embedding model selection.
Vector Similarity Search
Coming SoonCovers distance metrics, exact vs approximate search, complexity trade-offs, similarity search libraries.
HNSW Index
Coming SoonCovers HNSW algorithm, graph construction, search procedure, HNSW parameters.
IVF Index
Coming SoonCovers clustering approach, probe count, IVF-PQ combination, IVF vs HNSW.
Product Quantization
Coming SoonCovers PQ algorithm, codebook learning, PQ accuracy trade-offs, PQ for scale.
Hybrid Search
Coming SoonCovers BM25 + dense fusion, reciprocal rank fusion, weighted combination, hybrid benefits.
Reranking
Coming SoonCovers cross-encoder architecture, reranking procedure, reranker training, reranker latency.
RAG Prompt Engineering
Coming SoonCovers context placement, citation formats, context truncation, instruction design.
RAG Evaluation
Coming SoonCovers retrieval metrics, generation metrics, end-to-end evaluation, RAGAS framework.
Part XXX: Part XXX: Tool Use and Agents
7 chapters
Part XXX: Part XXX: Tool Use and Agents
Tool Use Motivation
Coming SoonCovers LLM limitations, tool augmentation, tool use examples, tool use benefits.
Function Calling
Coming SoonCovers function schema definition, function call generation, function output handling, function calling fine-tuning.
ReAct Pattern
Coming SoonCovers ReAct formulation, thought-action-observation loop, ReAct prompting, ReAct examples.
Tool Selection
Coming SoonCovers tool descriptions, tool routing, multi-tool scenarios, tool selection training.
Agent Architectures
Coming SoonCovers agent loop design, state management, planning strategies, agent termination.
Agent Memory
Coming SoonCovers short-term memory, long-term memory, memory retrieval, memory summarization.
Agent Evaluation
Coming SoonCovers task completion metrics, trajectory evaluation, agent benchmarks, safety evaluation.
Part : Part XXXI: Multimodal Models
8 chapters
Part : Part XXXI: Multimodal Models
Vision Transformer
Coming SoonCovers image patching, patch embeddings, ViT architecture, ViT pre-training.
CLIP
Coming SoonCovers CLIP architecture, CLIP training objective, CLIP zero-shot classification, CLIP embeddings.
Vision Encoders for VLMs
Coming SoonCovers ViT variants for VLMs, SigLIP improvements, image resolution handling, encoder selection.
Vision-Language Projection
Coming SoonCovers linear projection, MLP projection, Q-Former approach, projection training.
LLaVA Architecture
Coming SoonCovers LLaVA design, two-stage training, visual conversation, LLaVA variants.
Flamingo Architecture
Coming SoonCovers cross-attention to images, gated cross-attention, few-shot visual learning, Flamingo training.
Multimodal Training Data
Coming SoonCovers image-text pairs, interleaved documents, visual instruction data, data quality.
Multimodal Evaluation
Coming SoonCovers VQA benchmarks, multimodal understanding benchmarks, multimodal generation evaluation.
Part : Part XXXII: Speech and Audio
5 chapters
Part : Part XXXII: Speech and Audio
Speech Representations
Coming SoonCovers mel spectrograms, mel filterbanks, feature normalization, audio preprocessing.
Whisper Architecture
Coming SoonCovers Whisper encoder-decoder, multitask training, language tokens, timestamp prediction.
Whisper Training
Coming SoonCovers Whisper training data, weak supervision, multilingual training, Whisper capabilities.
Speech-Language Integration
Coming SoonCovers speech encoder + LLM, audio tokens, speech-to-text-to-LLM vs end-to-end, speech LLM architectures.
Text-to-Speech
Coming SoonCovers TTS architecture overview, vocoder role, TTS quality metrics, neural TTS approaches.
Part : Part XXXIII: Evaluation Fundamentals
7 chapters
Part : Part XXXIII: Evaluation Fundamentals
Perplexity Evaluation
Coming SoonCovers perplexity calculation, perplexity interpretation, perplexity limitations, comparing perplexities.
Cross-Entropy Loss
Coming SoonCovers cross-entropy definition, bits-per-character, cross-entropy vs perplexity, loss curves.
BLEU Score
Coming SoonCovers n-gram precision, brevity penalty, BLEU formula, BLEU limitations, corpus vs sentence BLEU.
ROUGE Scores
Coming SoonCovers ROUGE-N, ROUGE-L, ROUGE-W, ROUGE interpretation, ROUGE limitations.
BERTScore
Coming SoonCovers BERTScore computation, token alignment, BERTScore variants, BERTScore vs BLEU.
Exact Match and F1
Coming SoonCovers exact match scoring, token-level F1, normalization for matching, metric selection.
Calibration
Coming SoonCovers calibration definition, expected calibration error, calibration plots, calibration methods.
Part : Part XXXIV: Benchmark Evaluation
8 chapters
Part : Part XXXIV: Benchmark Evaluation
MMLU
Coming SoonCovers MMLU structure, subject coverage, MMLU evaluation protocol, MMLU limitations.
HellaSwag
Coming SoonCovers HellaSwag task design, adversarial filtering, HellaSwag evaluation, HellaSwag saturation.
GSM8K
Coming SoonCovers GSM8K problem types, chain-of-thought evaluation, GSM8K accuracy metrics, math reasoning assessment.
HumanEval
Coming SoonCovers HumanEval structure, functional correctness, pass@k metric, HumanEval limitations.
MBPP
Coming SoonCovers MBPP dataset, MBPP vs HumanEval, code evaluation challenges.
TruthfulQA
Coming SoonCovers TruthfulQA design, truthfulness vs informativeness, TruthfulQA evaluation methods.
Benchmark Contamination
Coming SoonCovers contamination problem, contamination detection methods, n-gram overlap analysis, contamination mitigation.
Benchmark Saturation
Coming SoonCovers ceiling effects, benchmark retirement, dynamic benchmarks, benchmark evolution.
Part : Part XXXV: Human and Model Evaluation
6 chapters
Part : Part XXXV: Human and Model Evaluation
Human Evaluation Design
Coming SoonCovers evaluation interface design, task instructions, annotator selection, evaluation cost.
Inter-Annotator Agreement
Coming SoonCovers Cohen's kappa, Fleiss' kappa, Krippendorff's alpha, handling disagreement.
Preference Evaluation
Coming SoonCovers A/B comparison design, Elo rating systems, preference aggregation, statistical significance.
LLM-as-Judge
Coming SoonCovers judge prompt design, judge model selection, judge calibration, judge limitations.
Position Bias in LLM Judges
Coming SoonCovers position bias measurement, bias mitigation (swapping), verbosity bias, sycophancy.
Evaluation Prompt Engineering
Coming SoonCovers prompt sensitivity, evaluation prompt design, few-shot vs zero-shot evaluation, evaluation consistency.
Part : Part XXXVI: Bias and Fairness
5 chapters
Part : Part XXXVI: Bias and Fairness
Bias in Language Models
Coming SoonCovers bias sources, bias types (demographic, cultural), bias in training data, bias amplification.
Bias Measurement
Coming SoonCovers embedding association tests, generation bias metrics, classification bias metrics, bias benchmarks.
Bias Mitigation
Coming SoonCovers data balancing, fine-tuning for fairness, prompt-based mitigation, debiasing embeddings.
Fairness Metrics
Coming SoonCovers demographic parity, equalized odds, fairness trade-offs, choosing fairness metrics.
Representation Harms
Coming SoonCovers stereotyping, erasure, demeaning associations, measuring representation harms.
Part : Part XXXVII: Hallucination and Factuality
6 chapters
Part : Part XXXVII: Hallucination and Factuality
Hallucination Types
Coming SoonCovers intrinsic vs extrinsic hallucination, factual errors, fabrication, inconsistency.
Hallucination Detection
Coming SoonCovers entailment-based detection, knowledge base verification, self-consistency checks, detection models.
Hallucination Causes
Coming SoonCovers training data issues, exposure bias, knowledge gaps, generation pressure.
Hallucination Mitigation
Coming SoonCovers retrieval augmentation, decoding strategies, training approaches, uncertainty expression.
Attribution and Citation
Coming SoonCovers inline citation, attribution accuracy, source verification, attribution evaluation.
Uncertainty Quantification
Coming SoonCovers confidence calibration, verbalized uncertainty, sampling-based uncertainty, uncertainty communication.
Part : Part XXXVIII: Safety and Security
8 chapters
Part : Part XXXVIII: Safety and Security
Safety Risks
Coming SoonCovers harmful content generation, misuse scenarios, unintended harms, safety threat models.
Red Teaming
Coming SoonCovers red team methodology, attack taxonomies, red team findings, red team automation.
Jailbreaking
Coming SoonCovers jailbreak techniques, prompt injection, adversarial suffixes, jailbreak defenses.
Prompt Injection
Coming SoonCovers direct prompt injection, indirect prompt injection, injection in RAG, injection defenses.
Content Filtering
Coming SoonCovers classification-based filtering, rule-based filtering, filter placement, filter evaluation.
Guardrails
Coming SoonCovers input guardrails, output guardrails, guardrail frameworks, guardrail design.
Memorization and Privacy
Coming SoonCovers memorization measurement, extractable memorization, PII in training data, privacy risks.
Differential Privacy
Coming SoonCovers DP-SGD basics, privacy budget, DP accuracy trade-offs, DP for LLMs.
Part : Part XXXIX: Interpretability
11 chapters
Part : Part XXXIX: Interpretability
Interpretability Goals
Coming SoonCovers debugging, trust, safety, scientific understanding, interpretability approaches overview.
Attention Visualization
Coming SoonCovers attention weight extraction, attention head visualization, attention interpretation caveats, attention tools.
Attention Analysis Limitations
Coming SoonCovers attention vs importance, attention manipulation studies, gradient-based alternatives.
Probing Classifiers
Coming SoonCovers linear probing methodology, probing task design, probing interpretation, control tasks.
Probing Layers
Coming SoonCovers layer selection, representation evolution, task localization, layer probing patterns.
Activation Patching
Coming SoonCovers patching methodology, locating information, patching experiments, causal tracing.
Logit Lens
Coming SoonCovers logit lens concept, intermediate vocabulary projection, tuned lens, lens interpretation.
Sparse Autoencoders
Coming SoonCovers SAE architecture, sparsity constraints, dictionary learning, SAE for LLMs.
Feature Interpretation
Coming SoonCovers feature activation patterns, feature naming, automated interpretation, feature circuits.
Mechanistic Interpretability
Coming SoonCovers circuit analysis, algorithmic tasks, induction heads, mechanistic discoveries.
Activation Steering
Coming SoonCovers steering vectors, activation addition, representation engineering, steering applications.
Part : Part XL: Data Curation
10 chapters
Part : Part XL: Data Curation
Web Crawling
Coming SoonCovers Common Crawl, crawling strategies, robots.txt respect, crawl freshness.
Document Extraction
Coming SoonCovers HTML parsing, boilerplate removal, content extraction, trafilatura and similar tools.
Language Identification
Coming SoonCovers language ID models, multilingual document handling, code-switching, language filtering.
Deduplication
Coming SoonCovers exact deduplication, near-duplicate detection, document vs substring dedup, dedup at scale.
MinHash
Coming SoonCovers MinHash algorithm, Jaccard similarity estimation, MinHash LSH, MinHash implementation.
Quality Filtering
Coming SoonCovers heuristic filters, perplexity filtering, classifier-based filtering, filter thresholds.
Toxicity Filtering
Coming SoonCovers toxicity classifiers, toxicity thresholds, over-filtering risks, toxicity filter evaluation.
PII Removal
Coming SoonCovers PII detection methods, PII removal strategies, PII removal evaluation, privacy preservation.
Data Mixing
Coming SoonCovers domain proportions, quality weighting, data mixing experiments, optimal mixing.
Synthetic Data
Coming SoonCovers synthetic data generation, quality verification, synthetic data diversity, distillation.
Part : Part XLI: Training Infrastructure
11 chapters
Part : Part XLI: Training Infrastructure
GPU Architecture
Coming SoonCovers GPU memory hierarchy, CUDA cores, tensor cores, GPU specifications.
Memory Management
Coming SoonCovers memory breakdown (activations, parameters, gradients, optimizer states), memory estimation, OOM debugging.
Data Parallelism
Coming SoonCovers DDP algorithm, gradient synchronization, all-reduce operations, DDP scaling.
Tensor Parallelism
Coming SoonCovers column parallelism, row parallelism, communication patterns, Megatron-style parallelism.
Pipeline Parallelism
Coming SoonCovers pipeline stages, micro-batching, pipeline bubbles, pipeline schedules (GPipe, 1F1B).
ZeRO Optimization
Coming SoonCovers ZeRO stage 1 (optimizer state partitioning), ZeRO stage 2 (gradient partitioning), ZeRO stage 3 (parameter partitioning), ZeRO memory savings.
FSDP
Coming SoonCovers FSDP concepts, FSDP vs ZeRO, FSDP sharding strategies, FSDP usage.
Activation Checkpointing
Coming SoonCovers checkpointing concept, checkpoint selection, checkpointing overhead, selective checkpointing.
Mixed Precision Training
Coming SoonCovers floating point formats, loss scaling, BF16 advantages, mixed precision implementation.
Communication Optimization
Coming SoonCovers gradient compression, communication overlap, topology-aware communication, NCCL optimization.
Checkpointing and Recovery
Coming SoonCovers checkpoint contents, checkpoint frequency, async checkpointing, fault recovery.
Part : Part XLII: Training Optimization
8 chapters
Part : Part XLII: Training Optimization
Learning Rate Warmup
Coming SoonCovers warmup motivation, linear warmup, warmup duration, warmup for large batches.
Learning Rate Decay
Coming SoonCovers step decay, exponential decay, inverse square root decay, decay scheduling.
Cosine Learning Rate Schedule
Coming SoonCovers cosine decay formula, cosine with restarts, cosine schedule parameters, cosine vs linear.
Large Batch Training
Coming SoonCovers batch size effects, learning rate scaling, batch size limits, LAMB optimizer.
Weight Decay
Coming SoonCovers weight decay formula, decoupled weight decay, weight decay selection, weight decay interaction with Adam.
Gradient Accumulation
Coming SoonCovers accumulation procedure, accumulation steps, accumulation for memory, accumulation correctness.
Training Stability
Coming SoonCovers loss spikes, gradient norm monitoring, stability techniques, training stability debugging.
Hyperparameter Selection
Coming SoonCovers hyperparameter search, hyperparameter transfer, critical vs robust hyperparameters, default recipes.
Part : Part XLIII: Code Generation
6 chapters
Part : Part XLIII: Code Generation
Code LLM Training
Coming SoonCovers code training data, code tokenization, fill-in-the-middle training, code pre-training objectives.
Code Understanding
Coming SoonCovers code explanation, bug detection, code review, code search.
Code Completion
Coming SoonCovers completion context, completion ranking, completion latency, completion UX.
Code Generation
Coming SoonCovers docstring-to-code, test-to-code, code generation strategies, generation quality.
Code Execution
Coming SoonCovers sandboxed execution, execution feedback, iterative refinement, execution safety.
Code Evaluation
Coming SoonCovers functional correctness, pass@k metric, code benchmarks, beyond correctness.
Part : Part XLIV: Production Systems
9 chapters
Part : Part XLIV: Production Systems
Model Serving
Coming SoonCovers serving frameworks, model loading, request handling, serving configuration.
Latency Optimization
Coming SoonCovers latency breakdown, batching latency, streaming responses, latency monitoring.
Throughput Optimization
Coming SoonCovers batch size tuning, GPU utilization, concurrent requests, throughput measurement.
Auto-scaling
Coming SoonCovers scaling metrics, horizontal scaling, scale-up vs scale-out, scaling policies.
Model Routing
Coming SoonCovers model selection, A/B testing, model cascades, routing strategies.
Caching
Coming SoonCovers prompt caching, semantic caching, cache invalidation, cache hit rates.
Monitoring
Coming SoonCovers metrics collection, alerting, logging, dashboards.
Quality Monitoring
Coming SoonCovers output quality metrics, drift detection, regression detection, quality alerts.
Cost Management
Coming SoonCovers cost modeling, cost optimization, cost allocation, cost monitoring.
Part : Part XLV: Continual Learning
5 chapters
Part : Part XLV: Continual Learning
Continual Learning Problem
Coming SoonCovers continual learning definition, catastrophic forgetting, continual learning scenarios.
Regularization Methods
Coming SoonCovers elastic weight consolidation, synaptic intelligence, parameter importance, regularization trade-offs.
Replay Methods
Coming SoonCovers replay buffer design, pseudo-rehearsal, generative replay, replay selection.
Architecture Methods
Coming SoonCovers progressive networks, expert expansion, architecture search, modular approaches.
Continual Learning Evaluation
Coming SoonCovers forward transfer, backward transfer, evaluation protocols, continual benchmarks.
Part : Part XLVI: Model Compression
6 chapters
Part : Part XLVI: Model Compression
Knowledge Distillation
Coming SoonCovers distillation objective, temperature in distillation, teacher selection, distillation for LLMs.
Distillation Variants
Coming SoonCovers feature distillation, attention transfer, progressive distillation, on-policy distillation.
Pruning Basics
Coming SoonCovers weight pruning, structured vs unstructured, pruning criteria, pruning schedule.
Structured Pruning
Coming SoonCovers head pruning, layer pruning, width pruning, structured pruning implementation.
Model Merging
Coming SoonCovers weight averaging, task arithmetic, TIES merging, DARE merging.
Model Merging Applications
Coming SoonCovers multi-task merging, style merging, capability composition, merging evaluation.
Part : Part XLVII: Advanced Topics
11 chapters
Part : Part XLVII: Advanced Topics
Constitutional AI
Coming SoonCovers constitutional principles, critique and revision, CAI training, CAI effectiveness.
Process Reward Models
Coming SoonCovers outcome vs process reward, PRM training, PRM for math, PRM limitations.
Test-Time Compute
Coming SoonCovers multiple sampling, self-consistency, iterative refinement, compute-optimal inference.
Chain-of-Thought
Coming SoonCovers CoT prompting, zero-shot CoT, CoT fine-tuning, CoT limitations.
Self-Consistency
Coming SoonCovers self-consistency procedure, sampling diversity, voting strategies, self-consistency effectiveness.
Tree of Thought
Coming SoonCovers ToT framework, thought generation, thought evaluation, ToT search.
Retrieval-Augmented Training
Coming SoonCovers RETRO architecture, retrieval during training, retrieved context integration.
Long-Form Generation
Coming SoonCovers outline-based generation, hierarchical generation, coherence maintenance, long-form evaluation.
Watermarking
Coming SoonCovers watermarking schemes, statistical detection, watermark robustness, watermark evaluation.
Model Cards
Coming SoonCovers model card contents, intended use documentation, limitation documentation, model card best practices.
Responsible Deployment
Coming SoonCovers release decisions, staged release, access control, deployment monitoring.
In Progress
This comprehensive handbook is currently in development. Each chapter will be published as it's completed, with practical examples, code implementations, and real-world applications.
Reference
Stay Updated
Get notified when new chapters are published.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.