Language AI Handbook Cover
In Progress

For

Engineers, researchers, students, AI enthusiasts, linguists, product managers, and anyone interested in understanding or building modern language AI systems, from foundational NLP to advanced large language models.

Language AI Handbook

A Complete Guide to Natural Language Processing and Large Language Models: From Classical NLP and Transformer Architecture to Pre-training, Fine-tuning, and Production Deployment

104h 37m total read time
159 of 380 chapters published

About This Book

Language AI has transformed from an academic curiosity into the defining technology of our era. But beneath the hype of ChatGPT and Claude lies a rich technical landscape that most practitioners only partially understand. This handbook gives you the complete picture, from classical NLP techniques that still matter to the cutting-edge architectures powering today's most capable systems.

Begin with the fundamentals that never go out of style: tokenization, embeddings, and the statistical foundations that inform modern approaches. Then dive deep into the transformer architecture. Learn not just how to use it, but how it actually works. Understand self-attention mathematically, grasp why positional encodings matter, and see how architectural choices like layer normalization affect training dynamics.

Track Your Progress

Sign in to mark chapters as complete, track quiz scores, and see your reading journey

Sign in →

What's Inside

Part 1

Text as Data

5 chapters2h 51m
Tap for chapters
Part 2

Classical Text Representations

9 chapters6h 1m
Tap for chapters
Part 3

Distributional Semantics

4 chapters2h 45m
Tap for chapters
Part 4

Word Embeddings

9 chapters8h 3m
Tap for chapters
Part 5

Subword Tokenization

8 chapters3h 55m
Tap for chapters
Part 6

Sequence Labeling

8 chapters5h 13m
Tap for chapters

Table of Contents

Part I: Text as Data

5 chapters

Part II: Classical Text Representations

9 chapters
6

Bag of Words

Covers document-term matrix construction, vocabulary building from corpus, word counting and frequency vectors, sparse matrix representation (CSR/CSC formats), vocabulary pruning (min_df, max_df), binary vs count representations, limitations of word order loss.

33m
7

N-grams

Covers bigram and trigram extraction, n-gram vocabulary explosion, n-gram frequency distributions, Zipf's law in n-grams, character n-grams for robustness, skip-grams and flexible windows, n-gram indexing for search.

23m
8

N-gram Language Models

Covers Markov assumption and chain rule, maximum likelihood estimation, probability calculation for sequences, handling unseen n-grams, start and end tokens, generating text from n-gram models, model storage and lookup efficiency.

42m
9

Smoothing Techniques

Covers add-one (Laplace) smoothing, add-k smoothing and tuning, Good-Turing smoothing derivation, Kneser-Ney smoothing intuition and formula, interpolation vs backoff, modified Kneser-Ney, comparing smoothing methods empirically.

36m
10

Perplexity

Covers cross-entropy definition and derivation, perplexity as branching factor, relationship to bits-per-character, held-out evaluation methodology, perplexity vs downstream performance, comparing models with perplexity, perplexity limitations and caveats.

43m
11

Term Frequency

Covers raw term frequency, log-scaled term frequency, boolean term frequency, augmented term frequency, L2-normalized frequency vectors, term frequency sparsity patterns, efficient term frequency computation.

55m
12

Inverse Document Frequency

Covers document frequency calculation, IDF formula derivation, IDF intuition (rare words matter more), smoothed IDF variants, IDF across corpus splits, relationship to information theory, implementing IDF efficiently.

33m
13

TF-IDF

Covers TF-IDF formula and variants, TF-IDF vector computation, TF-IDF normalization options, BM25 as TF-IDF extension, document similarity with TF-IDF, TF-IDF for feature extraction, sklearn TfidfVectorizer deep dive.

53m
14

BM25

Covers BM25 derivation from probabilistic IR, saturation parameter k1, length normalization parameter b, BM25+ and BM25L variants, field-weighted BM25, implementing BM25 scoring, BM25 vs TF-IDF empirically.

43m

Part III: Distributional Semantics

4 chapters

Part IV: Word Embeddings

9 chapters
19

Skip-gram Model

Covers skip-gram architecture diagram, input/output representations, softmax over vocabulary, skip-gram objective function, training data generation, window size hyperparameter, skip-gram vs CBOW intuition.

56m
20

CBOW Model

Covers CBOW architecture, context word averaging, CBOW objective function, CBOW vs skip-gram training speed, CBOW for frequent words, implementing CBOW forward pass, CBOW gradient derivation.

56m
21

Negative Sampling

Covers softmax computational bottleneck, negative sampling objective derivation, sampling distribution (unigram^0.75), number of negatives hyperparameter, negative sampling gradient computation, NCE vs negative sampling, implementing efficient sampling.

50m
22

Hierarchical Softmax

Covers binary tree construction (Huffman coding), path probability computation, hierarchical softmax objective, gradient computation along paths, tree structure impact on learning, hierarchical softmax vs negative sampling, when to use each approach.

68m
23

Word2Vec Training

Covers data preprocessing pipeline, subsampling frequent words, learning rate scheduling, minibatch vs online training, convergence monitoring, gensim Word2Vec usage, training from scratch in PyTorch.

42m
24

Word Analogy

Covers vector arithmetic for analogies, parallelogram model, analogy evaluation datasets, 3CosAdd vs 3CosMul methods, analogy accuracy metrics, limitations of analogy evaluation, what analogies reveal about embeddings.

57m
25

GloVe

Covers GloVe objective function derivation, weighted least squares formulation, relationship to matrix factorization, weighting function design, bias terms in GloVe, GloVe vs Word2Vec comparison, training GloVe efficiently.

60m
26

FastText

Covers character n-gram representation, word vector as n-gram sum, FastText architecture, handling OOV words, morphological awareness, FastText for morphologically rich languages, training FastText models.

49m
27

Embedding Evaluation

Covers intrinsic vs extrinsic evaluation, word similarity datasets (SimLex, WordSim), analogy accuracy, embedding visualization (t-SNE, UMAP), downstream task evaluation, embedding bias detection, evaluation pitfalls.

45m

Part V: Subword Tokenization

8 chapters
28

The Vocabulary Problem

Covers OOV word problem, vocabulary size explosion, rare word representation, morphological productivity, compound words, code and technical text, the case for subword units.

26m
29

Byte Pair Encoding

Covers BPE algorithm step-by-step, merge rules learning, vocabulary size control, BPE encoding procedure, BPE decoding procedure, BPE implementation from scratch, BPE hyperparameters.

34m
30

WordPiece

Covers WordPiece vs BPE differences, likelihood objective for merges, greedy tokenization algorithm, ## prefix notation, WordPiece in BERT, training WordPiece tokenizers, handling unknown characters.

24m
31

Unigram Language Model Tokenization

Covers unigram LM formulation, EM algorithm for training, Viterbi decoding for tokenization, sampling multiple segmentations, subword regularization, unigram vs BPE comparison, SentencePiece unigram mode.

20m
32

SentencePiece

Covers treating text as raw bytes, whitespace handling (▁ prefix), BPE and unigram modes, training from raw text, pretokenization elimination, SentencePiece in production, multilingual tokenization.

24m
33

Tokenizer Training

Covers corpus preparation, vocabulary size selection, special tokens configuration, training with HuggingFace tokenizers, saving and loading tokenizers, tokenizer versioning, domain-specific tokenizers.

31m
34

Special Tokens

Covers [CLS], [SEP], [PAD], [MASK], [UNK] tokens, beginning/end of sequence tokens, custom special tokens, special token embeddings, token type IDs, handling special tokens in generation.

34m
35

Tokenization Challenges

Covers number tokenization issues, code tokenization, multilingual text mixing, emoji and Unicode edge cases, tokenization artifacts, adversarial tokenization, measuring tokenization quality.

42m

Part VI: Sequence Labeling

8 chapters
36

Part-of-Speech Tagging

Covers POS tag sets (Penn Treebank, Universal), POS tagging as classification, contextual disambiguation, POS tagging accuracy metrics, POS tagging for downstream tasks, rule-based vs statistical taggers.

43m
37

Named Entity Recognition

Covers entity types (PER, ORG, LOC, etc.), NER as sequence labeling, nested entity challenges, entity boundary detection, NER evaluation (exact vs partial match), NER datasets and benchmarks.

34m
38

BIO Tagging

Covers BIO scheme explanation, BIOES/BILOU variants, converting spans to BIO tags, BIO decoding to spans, handling tagging inconsistencies, BIO for multi-label scenarios, implementing BIO utilities.

33m
39

Chunking

Covers noun phrase chunking, chunk types (NP, VP, PP), IOB tagging for chunks, chunking vs full parsing, chunking evaluation, chunking as preprocessing, regex chunking with NLTK.

31m
40

Hidden Markov Models

Covers HMM components (states, observations, transitions), emission and transition probabilities, HMM assumptions (Markov, independence), HMM for POS tagging, HMM parameter estimation, HMM limitations for NLP.

33m
41

Viterbi Algorithm

Covers optimal path problem formulation, Viterbi recursion derivation, backpointer tracking, Viterbi complexity analysis, log-space computation, implementing Viterbi efficiently, Viterbi for beam search foundation.

47m
42

Conditional Random Fields

Covers CRF vs HMM comparison, CRF feature functions, log-linear formulation, partition function computation, CRF for NER, CRF inference complexity, neural CRF layers.

59m
43

CRF Training

Covers CRF log-likelihood objective, forward-backward algorithm, gradient computation, L-BFGS optimization, feature template design, CRF regularization, CRF training convergence.

33m

Part VII: Neural Network Foundations

13 chapters
44

Linear Classifiers

Covers linear decision boundaries, weight vectors and bias, dot product interpretation, multiclass classification (softmax), linear classifier limitations, training with gradient descent.

43m
45

Activation Functions

Covers sigmoid function and saturation, tanh properties, ReLU and dying ReLU, Leaky ReLU and PReLU, ELU and SELU, GELU derivation and properties, Swish and Mish, choosing activation functions.

25m
46

Multilayer Perceptrons

Covers hidden layers and depth, weight matrices between layers, forward pass computation, representational capacity, MLP for classification, MLP for regression, MLP architecture design.

42m
47

Loss Functions

Covers cross-entropy loss derivation, MSE for regression, binary vs multiclass cross-entropy, label smoothing, focal loss for imbalance, loss function numerical stability, custom loss functions.

51m
48

Backpropagation

Covers computational graphs, chain rule review, forward and backward pass, gradient accumulation, backprop complexity analysis, automatic differentiation, implementing backprop from scratch.

71m
49

Stochastic Gradient Descent

Covers batch vs stochastic gradient descent, minibatch gradient descent, learning rate selection, SGD convergence properties, SGD noise as regularization, learning rate schedules basics, SGD implementation.

51m
50

Momentum

Covers momentum intuition (ball rolling), momentum update equations, momentum coefficient selection, dampening oscillations, momentum vs vanilla SGD, Nesterov momentum derivation, implementing momentum.

38m
51

Adam Optimizer

Covers exponential moving averages, first moment (mean) estimation, second moment (variance) estimation, bias correction derivation, Adam update rule, Adam hyperparameters, Adam convergence properties.

51m
52

AdamW

Covers L2 regularization vs weight decay, why they differ with Adam, AdamW formulation, weight decay coefficient selection, AdamW as default optimizer, AdamW vs Adam empirically.

34m
53

Weight Initialization

Covers random initialization importance, Xavier/Glorot initialization derivation, He initialization for ReLU, initialization for different activations, layer-wise initialization, initialization debugging, modern initialization practices.

42m
54

Batch Normalization

Covers internal covariate shift, batch statistics computation, learnable scale and shift, training vs inference mode, batch norm gradient flow, batch norm placement debates, batch norm limitations.

29m
55

Dropout

Covers dropout as ensemble, dropout mask sampling, inverted dropout scaling, dropout rate selection, dropout at inference, spatial dropout for sequences, dropout in modern architectures.

41m
56

Gradient Clipping

Covers gradient explosion detection, clip by value, clip by global norm, gradient clipping implementation, when to use gradient clipping, clipping threshold selection, monitoring gradient norms.

31m

Part VIII: Recurrent Neural Networks

9 chapters
57

RNN Architecture

Covers recurrent connection intuition, hidden state as memory, unrolled computation graph, parameter sharing across time, RNN for sequence classification, RNN for sequence generation, RNN equations and dimensions.

43m
58

Backpropagation Through Time

Covers BPTT derivation, gradient flow through time, truncated BPTT, BPTT memory requirements, BPTT implementation, gradient accumulation across timesteps.

46m
59

Vanishing Gradients

Covers gradient product across timesteps, vanishing gradient analysis, long-range dependency failure, gradient visualization, vanishing vs exploding trade-off, architectural solutions overview.

39m
60

LSTM Architecture

Covers cell state as information highway, gate mechanism intuition, LSTM diagram walkthrough, information flow in LSTMs, LSTM for long sequences, LSTM memory capacity.

35m
61

LSTM Gate Equations

Covers forget gate equations, input gate equations, cell state update, output gate equations, hidden state computation, LSTM parameter count, implementing LSTM from scratch.

40m
62

LSTM Gradient Flow

Covers constant error carousel, forget gate gradient highway, gradient flow analysis, LSTM vs vanilla RNN gradients, peephole connections, LSTM gradient clipping needs.

46m
63

GRU Architecture

Covers GRU vs LSTM comparison, reset gate function, update gate function, candidate hidden state, GRU equations, GRU parameter efficiency, when to choose GRU vs LSTM.

48m
64

Bidirectional RNNs

Covers forward and backward passes, hidden state concatenation, bidirectional architectures, bidirectionality for classification, limitations for generation, implementing bidirectional RNNs.

52m
65

Stacked RNNs

Covers multiple RNN layers, residual connections for depth, layer normalization in RNNs, depth vs width trade-offs, gradient flow in deep RNNs, practical depth limits.

44m

Part IX: Sequence-to-Sequence

7 chapters

Part X: Self-Attention

6 chapters

Part XI: Positional Encoding

7 chapters

Part XII: Transformer Blocks

8 chapters

Part XIII: Transformer Architectures

6 chapters

Part XIV: Efficient Attention

9 chapters
100

Quadratic Attention Bottleneck

Covers O(n²) memory analysis, O(n²) compute analysis, attention matrix size, practical sequence limits, bottleneck visualization, motivation for efficiency.

29m
101

Sparse Attention Patterns

Covers local attention windows, strided attention patterns, block-sparse attention, combining sparse patterns, sparse attention implementation.

39m
102

Sliding Window Attention

Covers sliding window formulation, window size selection, dilated sliding windows, sliding window for long sequences, Mistral-style windowed attention.

39m
103

Global Tokens

Covers CLS token global attention, learned global tokens, global-local attention mixing, global token count, implementation strategies.

24m
104

Longformer

Covers Longformer attention pattern, global attention configuration, Longformer complexity, Longformer for documents, Longformer implementation.

34m
105

BigBird

Covers BigBird attention pattern, random attention benefits, BigBird theoretical guarantees, BigBird vs Longformer, BigBird applications.

41m
106

Linear Attention

Covers softmax attention reformulation, kernel feature maps, linear complexity attention, linear attention limitations, Performer and variants.

42m
107

FlashAttention Algorithm

Covers GPU memory hierarchy, tiling for SRAM, online softmax computation, recomputation strategy, FlashAttention complexity, FlashAttention benefits.

46m
108

FlashAttention Implementation

Covers CUDA kernel basics, memory access patterns, FlashAttention-2 improvements, using FlashAttention in PyTorch, FlashAttention limitations.

53m

Part XV: Long Context

7 chapters

Part XVI: Pre-training Objectives

7 chapters

Part XVII: BERT and Variants

8 chapters

Part XVIII: GPT Architecture

10 chapters

Part XIX: Modern Decoder Models

7 chapters

Part XX: Encoder-Decoder Models

6 chapters

Part XXI: Scaling Laws

7 chapters

Part XXII: Emergent Capabilities

6 chapters
161

Emergence in Neural Networks

Soon

Covers emergence definition, phase transitions, emergence examples, emergence mechanisms, emergence debate.

162

In-Context Learning Emergence

Soon

Covers ICL emergence curves, ICL vs fine-tuning scaling, ICL mechanism hypotheses, ICL as meta-learning.

163

Chain-of-Thought Emergence

Soon

Covers CoT emergence observations, CoT elicitation, CoT scaling behavior, CoT mechanism theories.

164

Emergence vs Metrics

Soon

Covers discontinuous metrics, accuracy threshold effects, smooth underlying capabilities, re-examining emergence claims.

165

Inverse Scaling

Soon

Covers inverse scaling phenomena, distractor tasks, sycophancy scaling, inverse scaling prize findings.

166

Grokking

Soon

Covers grokking phenomenon, grokking in arithmetic, grokking mechanism theories, grokking phase transitions, practical implications.

Part XXIII: Mixture of Experts

10 chapters
167

Sparse Models

Soon

Covers dense vs sparse trade-offs, conditional computation motivation, sparse model efficiency, sparse model challenges.

168

Expert Networks

Soon

Covers expert architecture, expert as FFN, expert capacity, expert count selection, expert placement in transformer.

169

Gating Networks

Soon

Covers router architecture, routing score computation, router training, router learned behavior.

170

Top-K Routing

Soon

Covers top-1 routing, top-2 routing, k selection trade-offs, routing implementation, combining expert outputs.

171

Load Balancing

Soon

Covers expert utilization imbalance, collapse failure mode, load metrics, balanced routing importance.

172

Auxiliary Balancing Loss

Soon

Covers load balancing loss formulation, loss coefficient tuning, balancing vs task loss, auxiliary loss implementation.

173

Router Z-Loss

Soon

Covers router instability, z-loss formulation, z-loss benefits, z-loss coefficient, combined auxiliary losses.

174

Expert Parallelism

Soon

Covers expert placement strategies, all-to-all communication, communication overhead, expert parallelism implementation.

175

Switch Transformer

Soon

Covers Switch Transformer design, top-1 routing choice, capacity factor, Switch scaling results.

176

Mixtral

Soon

Covers Mixtral architecture, Mixtral expert design, Mixtral performance, Mixtral efficiency, Mixtral vs dense models.

Part XXIV: Fine-tuning Fundamentals

5 chapters
177

Transfer Learning

Soon

Covers transfer learning paradigm, pre-training/fine-tuning split, what transfers, transfer learning efficiency.

178

Full Fine-tuning

Soon

Covers full fine-tuning procedure, fine-tuning hyperparameters, learning rate selection, batch size effects.

179

Catastrophic Forgetting

Soon

Covers forgetting phenomenon, forgetting measurement, forgetting mitigation, pre-trained capability preservation.

180

Fine-tuning Learning Rates

Soon

Covers discriminative fine-tuning, layer-wise learning rates, warmup for fine-tuning, learning rate decay.

181

Fine-tuning Data Efficiency

Soon

Covers few-shot fine-tuning, data augmentation, sample efficiency patterns, small data strategies.

Part XXV: Parameter-Efficient Fine-tuning

12 chapters
182

PEFT Motivation

Soon

Covers parameter storage costs, multi-task deployment, PEFT efficiency, PEFT quality trade-offs.

183

LoRA Concept

Soon

Covers weight update decomposition, low-rank assumption, LoRA efficiency gains, LoRA flexibility.

184

LoRA Mathematics

Soon

Covers LoRA formulation W + BA, rank selection, initialization scheme, LoRA gradient computation.

185

LoRA Implementation

Soon

Covers LoRA module design, merging weights, LoRA training loop, LoRA in PyTorch, HuggingFace PEFT usage.

186

LoRA Hyperparameters

Soon

Covers rank selection guidelines, alpha/rank ratio, which layers to adapt, LoRA dropout.

187

QLoRA

Soon

Covers 4-bit quantization for base model, NF4 data type, double quantization, QLoRA memory savings.

188

AdaLoRA

Soon

Covers importance-based pruning, SVD-based adaptation, dynamic rank, AdaLoRA training procedure.

189

IA3

Soon

Covers IA3 formulation, learned rescaling vectors, IA3 parameter efficiency, IA3 vs LoRA.

190

Prefix Tuning

Soon

Covers prefix tuning formulation, prefix length selection, prefix tuning for generation, prefix vs LoRA.

191

Prompt Tuning

Soon

Covers prompt tuning formulation, prompt initialization, prompt tuning scaling, prompt length effects.

192

Adapter Layers

Soon

Covers adapter architecture, adapter placement, adapter dimensionality, adapter fusion.

193

PEFT Comparison

Soon

Covers performance comparison, parameter efficiency comparison, task suitability, practical recommendations.

Part XXVI: Instruction Tuning

6 chapters
194

Instruction Following

Soon

Covers instruction tuning motivation, instruction format design, instruction diversity, instruction quality.

195

Instruction Data Creation

Soon

Covers human annotation, template-based generation, seed task expansion, quality filtering.

196

Self-Instruct

Soon

Covers self-instruct procedure, instruction generation, response generation, filtering strategies.

197

Instruction Format

Soon

Covers prompt templates, system messages, multi-turn format, chat templates, role definitions.

198

Instruction Tuning Training

Soon

Covers instruction tuning data mixing, training hyperparameters, loss masking, multi-task learning.

199

Instruction Following Evaluation

Soon

Covers instruction following benchmarks, human evaluation, automatic evaluation, instruction difficulty.

Part XXVII: Alignment and RLHF

16 chapters
200

Alignment Problem

Soon

Covers alignment definition, helpfulness vs harmlessness, alignment challenges, alignment approaches overview.

201

Human Preference Data

Soon

Covers preference collection UI, comparison design, annotator guidelines, preference data quality.

202

Bradley-Terry Model

Soon

Covers pairwise comparison model, preference probability, Bradley-Terry likelihood, preference strength.

203

Reward Modeling

Soon

Covers reward model architecture, preference loss function, reward model training, reward model evaluation.

204

Reward Hacking

Soon

Covers reward hacking examples, distribution shift, over-optimization, reward hacking mitigation.

205

Policy Gradient Methods

Soon

Covers policy definition, REINFORCE algorithm, policy gradient derivation, variance reduction.

206

PPO Algorithm

Soon

Covers clipped objective, PPO derivation, trust region intuition, PPO implementation.

207

PPO for Language Models

Soon

Covers LLM as policy, action space (tokens), reward assignment, KL penalty importance.

208

RLHF Pipeline

Soon

Covers SFT stage, reward model training, PPO fine-tuning, RLHF hyperparameters, RLHF debugging.

209

KL Divergence Penalty

Soon

Covers KL penalty motivation, KL coefficient selection, adaptive KL, KL effects on training.

210

DPO Concept

Soon

Covers DPO motivation, removing reward model, DPO intuition, DPO benefits.

211

DPO Derivation

Soon

Covers DPO from RLHF objective, optimal policy derivation, DPO loss function, DPO as classification.

212

DPO Implementation

Soon

Covers DPO data format, DPO loss computation, DPO training procedure, DPO hyperparameters.

213

DPO Variants

Soon

Covers IPO formulation, KTO for unpaired feedback, ORPO, cDPO, comparing alignment methods.

214

RLAIF

Soon

Covers AI as annotator, constitutional AI principles, AI preference generation, RLAIF scalability.

215

Iterative Alignment

Soon

Covers iterative DPO, online preference learning, self-improvement loops, alignment stability.

Part XXVIII: Inference Optimization

14 chapters
216

KV Cache

Soon

Covers KV cache motivation, cache structure, cache memory requirements, cache management.

217

KV Cache Memory

Soon

Covers cache size calculation, batch size effects, sequence length effects, memory bottleneck.

218

Paged Attention

Soon

Covers memory fragmentation problem, page-based allocation, vLLM approach, paged attention benefits.

219

KV Cache Compression

Soon

Covers cache eviction strategies, attention sink preservation, H2O algorithm, cache quantization.

220

Weight Quantization Basics

Soon

Covers quantization fundamentals, per-tensor vs per-channel, symmetric vs asymmetric, calibration.

221

INT8 Quantization

Soon

Covers INT8 range mapping, absmax quantization, smooth quantization, INT8 accuracy.

222

INT4 Quantization

Soon

Covers 4-bit challenges, group-wise quantization, 4-bit accuracy trade-offs, 4-bit formats.

223

GPTQ

Soon

Covers GPTQ algorithm, layer-wise quantization, Hessian approximation, GPTQ implementation.

224

AWQ

Soon

Covers salient weight preservation, AWQ algorithm, AWQ vs GPTQ, AWQ benefits.

225

GGUF Format

Soon

Covers GGML/GGUF history, quantization types, GGUF file format, llama.cpp integration.

226

Speculative Decoding

Soon

Covers speculative decoding concept, draft model selection, verification procedure, acceptance rate.

227

Speculative Decoding Math

Soon

Covers acceptance criterion, expected speedup, draft quality effects, optimal draft length.

228

Continuous Batching

Soon

Covers static vs continuous batching, iteration-level scheduling, request completion handling, throughput gains.

229

Inference Serving

Soon

Covers inference server architecture, request routing, load balancing, auto-scaling, latency optimization.

Part XXIX: Retrieval-Augmented Generation

14 chapters
230

RAG Motivation

Soon

Covers knowledge limitations, parametric vs non-parametric, RAG benefits, RAG use cases.

231

RAG Architecture

Soon

Covers retriever component, generator component, retrieval timing, architecture variations.

232

Dense Retrieval

Soon

Covers bi-encoder architecture, embedding similarity, dense vs sparse retrieval, dense retrieval training.

233

Contrastive Learning for Retrieval

Soon

Covers contrastive loss, in-batch negatives, hard negative mining, DPR training procedure.

234

Document Chunking

Soon

Covers chunking strategies, chunk size selection, overlap handling, semantic chunking.

235

Embedding Models

Soon

Covers embedding model architectures, pooling strategies, embedding dimensions, embedding model selection.

236

Vector Similarity Search

Soon

Covers distance metrics, exact vs approximate search, complexity trade-offs, similarity search libraries.

237

HNSW Index

Soon

Covers HNSW algorithm, graph construction, search procedure, HNSW parameters.

238

IVF Index

Soon

Covers clustering approach, probe count, IVF-PQ combination, IVF vs HNSW.

239

Product Quantization

Soon

Covers PQ algorithm, codebook learning, PQ accuracy trade-offs, PQ for scale.

240

Hybrid Search

Soon

Covers BM25 + dense fusion, reciprocal rank fusion, weighted combination, hybrid benefits.

241

Reranking

Soon

Covers cross-encoder architecture, reranking procedure, reranker training, reranker latency.

242

RAG Prompt Engineering

Soon

Covers context placement, citation formats, context truncation, instruction design.

243

RAG Evaluation

Soon

Covers retrieval metrics, generation metrics, end-to-end evaluation, RAGAS framework.

Part XXX: Tool Use and Agents

7 chapters
244

Tool Use Motivation

Soon

Covers LLM limitations, tool augmentation, tool use examples, tool use benefits.

245

Function Calling

Soon

Covers function schema definition, function call generation, function output handling, function calling fine-tuning.

246

ReAct Pattern

Soon

Covers ReAct formulation, thought-action-observation loop, ReAct prompting, ReAct examples.

247

Tool Selection

Soon

Covers tool descriptions, tool routing, multi-tool scenarios, tool selection training.

248

Agent Architectures

Soon

Covers agent loop design, state management, planning strategies, agent termination.

249

Agent Memory

Soon

Covers short-term memory, long-term memory, memory retrieval, memory summarization.

250

Agent Evaluation

Soon

Covers task completion metrics, trajectory evaluation, agent benchmarks, safety evaluation.

Part XXXI: Multimodal Models

8 chapters
251

Vision Transformer

Soon

Covers image patching, patch embeddings, ViT architecture, ViT pre-training.

252

CLIP

Soon

Covers CLIP architecture, CLIP training objective, CLIP zero-shot classification, CLIP embeddings.

253

Vision Encoders for VLMs

Soon

Covers ViT variants for VLMs, SigLIP improvements, image resolution handling, encoder selection.

254

Vision-Language Projection

Soon

Covers linear projection, MLP projection, Q-Former approach, projection training.

255

LLaVA Architecture

Soon

Covers LLaVA design, two-stage training, visual conversation, LLaVA variants.

256

Flamingo Architecture

Soon

Covers cross-attention to images, gated cross-attention, few-shot visual learning, Flamingo training.

257

Multimodal Training Data

Soon

Covers image-text pairs, interleaved documents, visual instruction data, data quality.

258

Multimodal Evaluation

Soon

Covers VQA benchmarks, multimodal understanding benchmarks, multimodal generation evaluation.

Part XXXII: Speech and Audio

5 chapters
259

Speech Representations

Soon

Covers mel spectrograms, mel filterbanks, feature normalization, audio preprocessing.

260

Whisper Architecture

Soon

Covers Whisper encoder-decoder, multitask training, language tokens, timestamp prediction.

261

Whisper Training

Soon

Covers Whisper training data, weak supervision, multilingual training, Whisper capabilities.

262

Speech-Language Integration

Soon

Covers speech encoder + LLM, audio tokens, speech-to-text-to-LLM vs end-to-end, speech LLM architectures.

263

Text-to-Speech

Soon

Covers TTS architecture overview, vocoder role, TTS quality metrics, neural TTS approaches.

Part XXXIII: Evaluation Fundamentals

7 chapters
264

Perplexity Evaluation

Soon

Covers perplexity calculation, perplexity interpretation, perplexity limitations, comparing perplexities.

265

Cross-Entropy Loss

Soon

Covers cross-entropy definition, bits-per-character, cross-entropy vs perplexity, loss curves.

266

BLEU Score

Soon

Covers n-gram precision, brevity penalty, BLEU formula, BLEU limitations, corpus vs sentence BLEU.

267

ROUGE Scores

Soon

Covers ROUGE-N, ROUGE-L, ROUGE-W, ROUGE interpretation, ROUGE limitations.

268

BERTScore

Soon

Covers BERTScore computation, token alignment, BERTScore variants, BERTScore vs BLEU.

269

Exact Match and F1

Soon

Covers exact match scoring, token-level F1, normalization for matching, metric selection.

270

Calibration

Soon

Covers calibration definition, expected calibration error, calibration plots, calibration methods.

Part XXXIV: Benchmark Evaluation

8 chapters
271

MMLU

Soon

Covers MMLU structure, subject coverage, MMLU evaluation protocol, MMLU limitations.

272

HellaSwag

Soon

Covers HellaSwag task design, adversarial filtering, HellaSwag evaluation, HellaSwag saturation.

273

GSM8K

Soon

Covers GSM8K problem types, chain-of-thought evaluation, GSM8K accuracy metrics, math reasoning assessment.

274

HumanEval

Soon

Covers HumanEval structure, functional correctness, pass@k metric, HumanEval limitations.

275

MBPP

Soon

Covers MBPP dataset, MBPP vs HumanEval, code evaluation challenges.

276

TruthfulQA

Soon

Covers TruthfulQA design, truthfulness vs informativeness, TruthfulQA evaluation methods.

277

Benchmark Contamination

Soon

Covers contamination problem, contamination detection methods, n-gram overlap analysis, contamination mitigation.

278

Benchmark Saturation

Soon

Covers ceiling effects, benchmark retirement, dynamic benchmarks, benchmark evolution.

Part XXXV: Human and Model Evaluation

6 chapters
279

Human Evaluation Design

Soon

Covers evaluation interface design, task instructions, annotator selection, evaluation cost.

280

Inter-Annotator Agreement

Soon

Covers Cohen's kappa, Fleiss' kappa, Krippendorff's alpha, handling disagreement.

281

Preference Evaluation

Soon

Covers A/B comparison design, Elo rating systems, preference aggregation, statistical significance.

282

LLM-as-Judge

Soon

Covers judge prompt design, judge model selection, judge calibration, judge limitations.

283

Position Bias in LLM Judges

Soon

Covers position bias measurement, bias mitigation (swapping), verbosity bias, sycophancy.

284

Evaluation Prompt Engineering

Soon

Covers prompt sensitivity, evaluation prompt design, few-shot vs zero-shot evaluation, evaluation consistency.

Part XXXVI: Bias and Fairness

5 chapters
285

Bias in Language Models

Soon

Covers bias sources, bias types (demographic, cultural), bias in training data, bias amplification.

286

Bias Measurement

Soon

Covers embedding association tests, generation bias metrics, classification bias metrics, bias benchmarks.

287

Bias Mitigation

Soon

Covers data balancing, fine-tuning for fairness, prompt-based mitigation, debiasing embeddings.

288

Fairness Metrics

Soon

Covers demographic parity, equalized odds, fairness trade-offs, choosing fairness metrics.

289

Representation Harms

Soon

Covers stereotyping, erasure, demeaning associations, measuring representation harms.

Part XXXVII: Hallucination and Factuality

6 chapters
290

Hallucination Types

Soon

Covers intrinsic vs extrinsic hallucination, factual errors, fabrication, inconsistency.

291

Hallucination Detection

Soon

Covers entailment-based detection, knowledge base verification, self-consistency checks, detection models.

292

Hallucination Causes

Soon

Covers training data issues, exposure bias, knowledge gaps, generation pressure.

293

Hallucination Mitigation

Soon

Covers retrieval augmentation, decoding strategies, training approaches, uncertainty expression.

294

Attribution and Citation

Soon

Covers inline citation, attribution accuracy, source verification, attribution evaluation.

295

Uncertainty Quantification

Soon

Covers confidence calibration, verbalized uncertainty, sampling-based uncertainty, uncertainty communication.

Part XXXVIII: Safety and Security

8 chapters
296

Safety Risks

Soon

Covers harmful content generation, misuse scenarios, unintended harms, safety threat models.

297

Red Teaming

Soon

Covers red team methodology, attack taxonomies, red team findings, red team automation.

298

Jailbreaking

Soon

Covers jailbreak techniques, prompt injection, adversarial suffixes, jailbreak defenses.

299

Prompt Injection

Soon

Covers direct prompt injection, indirect prompt injection, injection in RAG, injection defenses.

300

Content Filtering

Soon

Covers classification-based filtering, rule-based filtering, filter placement, filter evaluation.

301

Guardrails

Soon

Covers input guardrails, output guardrails, guardrail frameworks, guardrail design.

302

Memorization and Privacy

Soon

Covers memorization measurement, extractable memorization, PII in training data, privacy risks.

303

Differential Privacy

Soon

Covers DP-SGD basics, privacy budget, DP accuracy trade-offs, DP for LLMs.

Part XXXIX: Interpretability

11 chapters
304

Interpretability Goals

Soon

Covers debugging, trust, safety, scientific understanding, interpretability approaches overview.

305

Attention Visualization

Soon

Covers attention weight extraction, attention head visualization, attention interpretation caveats, attention tools.

306

Attention Analysis Limitations

Soon

Covers attention vs importance, attention manipulation studies, gradient-based alternatives.

307

Probing Classifiers

Soon

Covers linear probing methodology, probing task design, probing interpretation, control tasks.

308

Probing Layers

Soon

Covers layer selection, representation evolution, task localization, layer probing patterns.

309

Activation Patching

Soon

Covers patching methodology, locating information, patching experiments, causal tracing.

310

Logit Lens

Soon

Covers logit lens concept, intermediate vocabulary projection, tuned lens, lens interpretation.

311

Sparse Autoencoders

Soon

Covers SAE architecture, sparsity constraints, dictionary learning, SAE for LLMs.

312

Feature Interpretation

Soon

Covers feature activation patterns, feature naming, automated interpretation, feature circuits.

313

Mechanistic Interpretability

Soon

Covers circuit analysis, algorithmic tasks, induction heads, mechanistic discoveries.

314

Activation Steering

Soon

Covers steering vectors, activation addition, representation engineering, steering applications.

Part XL: Data Curation

10 chapters
315

Web Crawling

Soon

Covers Common Crawl, crawling strategies, robots.txt respect, crawl freshness.

316

Document Extraction

Soon

Covers HTML parsing, boilerplate removal, content extraction, trafilatura and similar tools.

317

Language Identification

Soon

Covers language ID models, multilingual document handling, code-switching, language filtering.

318

Deduplication

Soon

Covers exact deduplication, near-duplicate detection, document vs substring dedup, dedup at scale.

319

MinHash

Soon

Covers MinHash algorithm, Jaccard similarity estimation, MinHash LSH, MinHash implementation.

320

Quality Filtering

Soon

Covers heuristic filters, perplexity filtering, classifier-based filtering, filter thresholds.

321

Toxicity Filtering

Soon

Covers toxicity classifiers, toxicity thresholds, over-filtering risks, toxicity filter evaluation.

322

PII Removal

Soon

Covers PII detection methods, PII removal strategies, PII removal evaluation, privacy preservation.

323

Data Mixing

Soon

Covers domain proportions, quality weighting, data mixing experiments, optimal mixing.

324

Synthetic Data

Soon

Covers synthetic data generation, quality verification, synthetic data diversity, distillation.

Part XLI: Training Infrastructure

11 chapters
325

GPU Architecture

Soon

Covers GPU memory hierarchy, CUDA cores, tensor cores, GPU specifications.

326

Memory Management

Soon

Covers memory breakdown (activations, parameters, gradients, optimizer states), memory estimation, OOM debugging.

327

Data Parallelism

Soon

Covers DDP algorithm, gradient synchronization, all-reduce operations, DDP scaling.

328

Tensor Parallelism

Soon

Covers column parallelism, row parallelism, communication patterns, Megatron-style parallelism.

329

Pipeline Parallelism

Soon

Covers pipeline stages, micro-batching, pipeline bubbles, pipeline schedules (GPipe, 1F1B).

330

ZeRO Optimization

Soon

Covers ZeRO stage 1 (optimizer state partitioning), ZeRO stage 2 (gradient partitioning), ZeRO stage 3 (parameter partitioning), ZeRO memory savings.

331

FSDP

Soon

Covers FSDP concepts, FSDP vs ZeRO, FSDP sharding strategies, FSDP usage.

332

Activation Checkpointing

Soon

Covers checkpointing concept, checkpoint selection, checkpointing overhead, selective checkpointing.

333

Mixed Precision Training

Soon

Covers floating point formats, loss scaling, BF16 advantages, mixed precision implementation.

334

Communication Optimization

Soon

Covers gradient compression, communication overlap, topology-aware communication, NCCL optimization.

335

Checkpointing and Recovery

Soon

Covers checkpoint contents, checkpoint frequency, async checkpointing, fault recovery.

Part XLII: Training Optimization

8 chapters
336

Learning Rate Warmup

Soon

Covers warmup motivation, linear warmup, warmup duration, warmup for large batches.

337

Learning Rate Decay

Soon

Covers step decay, exponential decay, inverse square root decay, decay scheduling.

338

Cosine Learning Rate Schedule

Soon

Covers cosine decay formula, cosine with restarts, cosine schedule parameters, cosine vs linear.

339

Large Batch Training

Soon

Covers batch size effects, learning rate scaling, batch size limits, LAMB optimizer.

340

Weight Decay

Soon

Covers weight decay formula, decoupled weight decay, weight decay selection, weight decay interaction with Adam.

341

Gradient Accumulation

Soon

Covers accumulation procedure, accumulation steps, accumulation for memory, accumulation correctness.

342

Training Stability

Soon

Covers loss spikes, gradient norm monitoring, stability techniques, training stability debugging.

343

Hyperparameter Selection

Soon

Covers hyperparameter search, hyperparameter transfer, critical vs robust hyperparameters, default recipes.

Part XLIII: Code Generation

6 chapters
344

Code LLM Training

Soon

Covers code training data, code tokenization, fill-in-the-middle training, code pre-training objectives.

345

Code Understanding

Soon

Covers code explanation, bug detection, code review, code search.

346

Code Completion

Soon

Covers completion context, completion ranking, completion latency, completion UX.

347

Code Generation

Soon

Covers docstring-to-code, test-to-code, code generation strategies, generation quality.

348

Code Execution

Soon

Covers sandboxed execution, execution feedback, iterative refinement, execution safety.

349

Code Evaluation

Soon

Covers functional correctness, pass@k metric, code benchmarks, beyond correctness.

Part XLIV: Production Systems

9 chapters
350

Model Serving

Soon

Covers serving frameworks, model loading, request handling, serving configuration.

351

Latency Optimization

Soon

Covers latency breakdown, batching latency, streaming responses, latency monitoring.

352

Throughput Optimization

Soon

Covers batch size tuning, GPU utilization, concurrent requests, throughput measurement.

353

Auto-scaling

Soon

Covers scaling metrics, horizontal scaling, scale-up vs scale-out, scaling policies.

354

Model Routing

Soon

Covers model selection, A/B testing, model cascades, routing strategies.

355

Caching

Soon

Covers prompt caching, semantic caching, cache invalidation, cache hit rates.

356

Monitoring

Soon

Covers metrics collection, alerting, logging, dashboards.

357

Quality Monitoring

Soon

Covers output quality metrics, drift detection, regression detection, quality alerts.

358

Cost Management

Soon

Covers cost modeling, cost optimization, cost allocation, cost monitoring.

Part XLV: Continual Learning

5 chapters
359

Continual Learning Problem

Soon

Covers continual learning definition, catastrophic forgetting, continual learning scenarios.

360

Regularization Methods

Soon

Covers elastic weight consolidation, synaptic intelligence, parameter importance, regularization trade-offs.

361

Replay Methods

Soon

Covers replay buffer design, pseudo-rehearsal, generative replay, replay selection.

362

Architecture Methods

Soon

Covers progressive networks, expert expansion, architecture search, modular approaches.

363

Continual Learning Evaluation

Soon

Covers forward transfer, backward transfer, evaluation protocols, continual benchmarks.

Part XLVI: Model Compression

6 chapters
364

Knowledge Distillation

Soon

Covers distillation objective, temperature in distillation, teacher selection, distillation for LLMs.

365

Distillation Variants

Soon

Covers feature distillation, attention transfer, progressive distillation, on-policy distillation.

366

Pruning Basics

Soon

Covers weight pruning, structured vs unstructured, pruning criteria, pruning schedule.

367

Structured Pruning

Soon

Covers head pruning, layer pruning, width pruning, structured pruning implementation.

368

Model Merging

Soon

Covers weight averaging, task arithmetic, TIES merging, DARE merging.

369

Model Merging Applications

Soon

Covers multi-task merging, style merging, capability composition, merging evaluation.

Part XLVII: Advanced Topics

11 chapters
370

Constitutional AI

Soon

Covers constitutional principles, critique and revision, CAI training, CAI effectiveness.

371

Process Reward Models

Soon

Covers outcome vs process reward, PRM training, PRM for math, PRM limitations.

372

Test-Time Compute

Soon

Covers multiple sampling, self-consistency, iterative refinement, compute-optimal inference.

373

Chain-of-Thought

Soon

Covers CoT prompting, zero-shot CoT, CoT fine-tuning, CoT limitations.

374

Self-Consistency

Soon

Covers self-consistency procedure, sampling diversity, voting strategies, self-consistency effectiveness.

375

Tree of Thought

Soon

Covers ToT framework, thought generation, thought evaluation, ToT search.

376

Retrieval-Augmented Training

Soon

Covers RETRO architecture, retrieval during training, retrieved context integration.

377

Long-Form Generation

Soon

Covers outline-based generation, hierarchical generation, coherence maintenance, long-form evaluation.

378

Watermarking

Soon

Covers watermarking schemes, statistical detection, watermark robustness, watermark evaluation.

379

Model Cards

Soon

Covers model card contents, intended use documentation, limitation documentation, model card best practices.

380

Responsible Deployment

Soon

Covers release decisions, staged release, access control, deployment monitoring.

In Progress

This comprehensive handbook is currently in development. Each chapter will be published as it's completed, with practical examples, code implementations, and real-world applications.

Reference

BIBTEXAcademic
@book{languageaihandbook, author = {Michael Brenndoerfer}, title = {Language AI Handbook}, year = {2025}, url = {https://mbrenndoerfer.com/books/language-ai-handbook}, publisher = {mbrenndoerfer.com}, note = {Accessed: 2025-12-26} }
APAAcademic
Michael Brenndoerfer (2025). Language AI Handbook. Retrieved from https://mbrenndoerfer.com/books/language-ai-handbook
MLAAcademic
Michael Brenndoerfer. "Language AI Handbook." 2025. Web. 12/26/2025. <https://mbrenndoerfer.com/books/language-ai-handbook>.
CHICAGOAcademic
Michael Brenndoerfer. "Language AI Handbook." Accessed 12/26/2025. https://mbrenndoerfer.com/books/language-ai-handbook.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Language AI Handbook'. Available at: https://mbrenndoerfer.com/books/language-ai-handbook (Accessed: 12/26/2025).
SimpleBasic
Michael Brenndoerfer (2025). Language AI Handbook. https://mbrenndoerfer.com/books/language-ai-handbook

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free