A

activation function

An activation function introduces non-linearity into neural networks, allowing them to learn complex patterns. Common examples include sigmoid, tanh, and ReLU functions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Time Delay Neural Networks (TDNN)
View in context

activation functions

An activation function introduces non-linearity into neural networks, allowing them to learn complex patterns. Common examples include sigmoid, tanh, and ReLU functions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1986: Backpropagation
View in context

ADALINE units

An adaptive linear element (ADALINE) is a single neuron that can learn to classify patterns by adjusting its weights using the LMS algorithm.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1962: Neural Networks (MADALINE)
View in context

Alignment techniques

Alignment techniques are methods for finding correspondences between different representations, such as words in different languages.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

Ambiguity

Ambiguity in language occurs when a sentence or phrase can be interpreted in multiple ways. This is a major challenge for rule-based systems that need to choose the correct interpretation.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

artificial intelligence

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. It encompasses various technologies including machine learning, natural language processing, and robotics.

Introduction → Rule-Based NLP: From Turing to Templates → 1950: The Turing Test
View in context

attention mechanisms

Attention mechanisms allow neural networks to focus on different parts of the input sequence when making predictions. They compute a weighted sum of input features, where the weights are learned and indicate the importance of each input element.

Introduction → A Quick Glance Through the History of Language AI
View in context

B

back-off method

Back-off is a technique that uses shorter n-grams when longer ones have insufficient data, allowing models to handle unseen word combinations gracefully.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

backpropagation

Backpropagation is an algorithm for efficiently training multi-layer neural networks by computing gradients of the error with respect to each weight.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1962: Neural Networks (MADALINE)
View in context

bias term

A bias term is a constant value added to the weighted sum of inputs in a neural network, allowing the neuron to shift its activation function and learn more flexible decision boundaries.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Time Delay Neural Networks (TDNN)
View in context

Bilingual Evaluation Understudy

Bilingual Evaluation Understudy — an n-gram precision-based automatic MT evaluation metric.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2002: BLEU Metric
View in context

BLEU

BLEU (Bilingual Evaluation Understudy) is an automatic evaluation metric for machine translation that measures how similar a machine translation is to reference human translations.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

blocks world

A blocks world is a simplified artificial environment containing geometric objects like blocks, pyramids, and boxes that can be manipulated. It's used in AI research to study language understanding in a controlled, manageable domain.

Introduction → Rule-Based NLP: From Turing to Templates → 1968: SHRDLU
View in context

brevity penalty

A multiplicative penalty discouraging overly short outputs.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2002: BLEU Metric
View in context

brittle

Brittle systems fail completely when encountering unexpected inputs, while robust systems can handle edge cases and continue functioning.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

buying entails paying

When one action necessarily implies another action

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: WordNet 1.0
View in context

C

car ↔ automobile

Words that can be used interchangeably in certain contexts

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: WordNet 1.0
View in context

car is a holonym of wheel

A whole that contains smaller parts

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: WordNet 1.0
View in context

car is a hyponym of vehicle

A more specific term that falls under a general category

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: WordNet 1.0
View in context

cell state

A dedicated memory pathway that carries information across time steps with minimal modification.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1997: Long Short-Term Memory (LSTM)
View in context

cell state ($C_t$)

The horizontal line running through the top of the cell, represented as the green pathway that carries information across time steps.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1997: Long Short-Term Memory (LSTM)
View in context

chain rule

The chain rule is a fundamental theorem in calculus that allows us to compute the derivative of a composite function by multiplying the derivatives of its component functions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1986: Backpropagation
View in context

ChatGPT's conversational abilities

ChatGPT is a conversational AI model that can engage in natural language dialogue. It was trained using Reinforcement Learning from Human Feedback (RLHF) to make it more helpful, honest, and harmless.

Introduction → A Quick Glance Through the History of Language AI
View in context

chunking

Identifying groups of words that function as a single unit, like noun phrases or verb phrases.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

conditional likelihood

The probability of the correct output sequence given the input sequence.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Conditional modeling

Modeling the probability of outputs given inputs, rather than modeling the joint probability of inputs and outputs together.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

conditional probability

The probability of an event occurring given that another event has already occurred.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Consistency

Consistency refers to how uniform the output is across different inputs. Statistical MT produced more uniform translations across different types of text.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

Context features

Features that consider the broader context beyond just the current and adjacent elements.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Context-Free Grammars (CFGs)

Context-Free Grammars are formal grammars where each production rule has a single non-terminal symbol on the left side. They are 'context-free' because the rule can be applied regardless of the surrounding context.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

continuous

Values that can take on any value within a range (like sound waves, temperatures, or distances)

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

Convolutional layers

A mathematical operation that applies a filter (kernel) to input data, sliding it across the data to detect patterns

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1988: Convolutional Neural Networks (CNN)
View in context

convolutional networks

Convolutional neural networks use multiple layers of feature detectors to build hierarchical representations of input data.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1962: Neural Networks (MADALINE)
View in context

convolutional neural networks

Convolutional Neural Networks (CNNs) are neural networks that use convolutional layers to process data with grid-like topology, such as images or time series data, by applying the same filter across different positions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Time Delay Neural Networks (TDNN)
View in context

credit assignment problem

The credit assignment problem asks how to determine which parts of a system are responsible for errors or successes, particularly in systems with many interconnected components like neural networks.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1986: Backpropagation
View in context

CRFs

A discriminative probabilistic model that defines P(labels|inputs) for sequences or graphs.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

curse of dimensionality

In machine learning, the curse of dimensionality refers to various phenomena that arise when analyzing data in high-dimensional spaces. In language modeling, as n-gram length increases, the number of possible word combinations grows exponentially, making the data increasingly sparse.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

D

Data-driven learning

Data-driven learning refers to approaches that learn patterns from data rather than relying on hand-written rules or expert knowledge.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

Decoding

Decoding in HMMs finds the most likely sequence of hidden states that could have generated the observed outputs. This is the core of many applications like speech recognition.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

deep neural networks

Deep neural networks contain multiple hidden layers that learn hierarchical representations, with each layer building upon features learned by previous layers.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1957: The Perceptron
View in context

dependencies between adjacent elements

Relationships between elements that are next to each other in the sequence.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Dependency Grammar

Dependency Grammar focuses on the relationships between words rather than phrase structure. Each word (except the root) depends on exactly one other word, creating a tree of dependencies.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

determiner

Words like 'the', 'a', 'an' that introduce and specify nouns.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Determiner, Noun, Verb, Preposition, Determiner, Noun

DT=Determiner, NN=Noun, VB=Verb, IN=Preposition - standard abbreviations for grammatical categories.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

discount factor

A discount factor reduces the probability of observed events to reserve probability mass for unseen events, preventing overconfidence in sparse data.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

Discrete

Values that can only take on specific, separate values (like words, categories, or whole numbers)

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

dynamic programming

Dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It's used in HMM algorithms like the Viterbi algorithm.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

E

edge cases

Edge cases are unusual or extreme situations that test the limits of a system's capabilities, requiring robust handling to prevent failures.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

embodied AI

Embodied AI refers to artificial intelligence systems that interact with the physical world through sensors and actuators, rather than being purely computational. It emphasizes the importance of physical interaction for intelligence.

Introduction → Rule-Based NLP: From Turing to Templates → 1968: SHRDLU
View in context

Emission Probabilities

Emission probabilities describe how likely each observation is given a hidden state. They capture the relationship between hidden states and observable outputs.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

End-to-end learning

End-to-end learning refers to training systems to perform a complete task directly, without requiring intermediate steps like explicit alignment.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

error-driven learning approach

The LMS algorithm established the principle of error-driven learning, where weights are adjusted proportionally to prediction errors.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1962: Neural Networks (MADALINE)
View in context

Evaluation

Evaluation in HMMs determines the probability of observing a sequence of outputs given a model. This is used to score how well different models explain the observed data.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

Evaluation metrics

Evaluation metrics are objective measures used to assess the quality of machine learning systems, such as BLEU for translation quality.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

expectation-maximization

Expectation-maximization is an iterative algorithm for finding maximum likelihood estimates of parameters in statistical models with hidden variables.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

F

factor-graph

A graphical representation showing the relationships between variables in a probabilistic model, where nodes represent variables and edges represent dependencies.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Feature engineering

Feature engineering is the process of manually creating features from raw data that are relevant for machine learning models. HMMs required extensive hand-crafting of features.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

feature functions

Functions that fire (often 0/1) for particular input/label configurations, e.g., word shape with a tag.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Feature learning

The ability of neural networks to automatically discover useful features from raw data without manual engineering.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

feedforward neural networks

Feedforward neural networks are neural networks where connections between nodes do not form cycles, meaning information flows in one direction from input to output without any feedback loops.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Time Delay Neural Networks (TDNN)
View in context

forget gate

Controls which parts of the previous cell state are erased.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1997: Long Short-Term Memory (LSTM)
View in context

formal grammar rules

Formal grammar rules are explicit, mathematical descriptions of how sentences are structured. They define the syntax and relationships between words in a language.

Introduction → Rule-Based NLP: From Turing to Templates → 1968: SHRDLU
View in context

Formal grammars

Formal grammars are mathematical systems for describing the structure of languages. They provide precise rules for generating valid sentences in a language.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

frame problem

The frame problem is the challenge of determining which aspects of a situation are relevant when reasoning about actions and their consequences. It's a fundamental problem in AI that makes complete symbolic representation intractable in complex domains.

Introduction → Rule-Based NLP: From Turing to Templates → 1968: SHRDLU
View in context

G

global context

The broader context or overall structure that affects individual decisions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Global features

Features that capture properties of the entire sequence that affect local decisions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Global optimization

Finding the best solution by considering the entire sequence at once, rather than making decisions one element at a time.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Good-Turing estimator

The Good-Turing estimator is a statistical technique that estimates the probability of unseen events based on the frequency distribution of seen events. It was originally developed during WWII to help crack the Enigma code.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

GPT-3's few-shot learning

Few-shot learning is the ability of a model to learn new tasks from just a few examples, without extensive retraining. GPT-3 demonstrated this capability by performing various tasks after seeing just a few examples.

Introduction → A Quick Glance Through the History of Language AI
View in context

GPT-4's multimodal capabilities

Multimodal AI can process and understand multiple types of data simultaneously, such as text, images, audio, and video. GPT-4 can analyze images and text together to answer questions about visual content.

Introduction → A Quick Glance Through the History of Language AI
View in context

gradient descent

Gradient descent is an optimization algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function to find the minimum.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1986: Backpropagation
View in context

gradient-based learning

Gradient-based learning adjusts model parameters by following the gradient of the loss function, moving toward optimal solutions through iterative updates.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1957: The Perceptron
View in context

gradients

Gradients are vectors of partial derivatives that indicate how much a function changes with respect to each of its input variables. In neural networks, they show how the loss changes with respect to each weight.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1986: Backpropagation
View in context

grammatical reflection

Grammatical reflection is a technique where a statement is transformed into a question by changing its grammatical structure. For example, 'I am sad' becomes 'Why do you think you are sad?'

Introduction → Rule-Based NLP: From Turing to Templates → 1966: ELIZA
View in context

greedy local decisions

Making decisions one at a time based only on local information, without considering the full context.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

GRUs

Gated Recurrent Units, a simplified version of LSTM with fewer parameters but similar performance.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: Recurrent Neural Networks (RNNs)
View in context

H

hand-crafted features

Features that are manually designed and crafted by domain experts.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Handling dependencies

Explicitly modeling the relationships between different elements in the sequence.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Harris corners

A method for detecting corners and interest points in images

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1988: Convolutional Neural Networks (CNN)
View in context

Hidden Markov Models

A statistical model that describes a system with hidden states that influence observable outputs. The system follows the Markov property, meaning the current state depends only on the previous state.

Introduction → A Quick Glance Through the History of Language AI
View in context

Hidden Markov Models (HMMs)

A statistical model that describes a system with hidden states that influence observable outputs. The system follows the Markov property, meaning the current state depends only on the previous state.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

hidden state

An internal representation that carries information from previous time steps, acting as the network's memory.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: Recurrent Neural Networks (RNNs)
View in context

Hidden state inference

Hidden state inference is the process of determining the most likely hidden states given observed outputs, which is a key capability of HMMs.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

hidden states

Hidden states are the underlying system states that cannot be directly observed but influence the observable outputs. In HMMs, these states follow a Markov process.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

hierarchical approach

A hierarchical approach processes data at multiple levels, starting with the most specific (longest n-grams) and falling back to more general (shorter n-grams) when needed.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

hot ↔ cold

Words with contrasting or opposite meanings

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: WordNet 1.0
View in context

Hybrid systems

Hybrid systems combine multiple approaches, such as symbolic rules with statistical learning, to leverage the strengths of each method.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

I

IBM Model 1

IBM Model 1 was the simplest IBM translation model that aligned words one-to-one between languages. While limited, it established the basic statistical framework for machine translation.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

IBM Model 2

IBM Model 2 introduced alignment probabilities to handle the fact that word order differs between languages, making it more realistic than Model 1.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

IBM Model 3

IBM Model 3 added fertility modeling to handle cases where one source word translates to multiple target words, addressing a key limitation of earlier models.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

IBM Model 4

IBM Model 4 introduced distortion modeling to capture how word positions change during translation, making it more accurate for languages with different word orders.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

IBM Model 5

IBM Model 5 refined the distortion modeling and made the training process more stable, representing the most sophisticated of the IBM models.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

independent predictions

Making predictions for each element independently, without considering relationships between elements.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

information

In information theory, 'information' refers to the amount of uncertainty or surprise in a message. The more unpredictable something is, the more information it carries. This is measured in bits.

Introduction → Rule-Based NLP: From Turing to Templates → 1948: Shannon's N-gram Model
View in context

information extraction

Extracting structured information (like names, dates, relationships) from unstructured text.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

information theory

Information theory is a branch of mathematics that studies the quantification, storage, and communication of information. It was developed by Claude Shannon and provides the mathematical foundation for data compression and error correction.

Introduction → Rule-Based NLP: From Turing to Templates → 1948: Shannon's N-gram Model
View in context

input gate

Controls which new information enters the cell state.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1997: Long Short-Term Memory (LSTM)
View in context

interdependent

When the prediction of one element affects or is affected by the predictions of other elements in the sequence.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

interpolation

Interpolation refers to combining multiple probability estimates, often using a weighted average to create a more robust final estimate.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

J

joint probability

The probability of two or more events occurring together.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

K

Katz back-off

Katz back-off is a smoothing technique that handles unseen word sequences by 'backing off' to shorter n-grams when longer ones aren't available in the training data.

Introduction → Rule-Based NLP: From Turing to Templates → 1948: Shannon's N-gram Model
View in context

kernel size

Kernel size refers to the number of time steps or spatial dimensions that a filter or convolution operation covers in a single computation.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Time Delay Neural Networks (TDNN)
View in context

L

Language modeling

Language modeling involves predicting the probability of word sequences, forming the foundation for many NLP applications.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

language models

AI models designed to understand, generate, and work with human language.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

learned weights

Parameters that determine the importance of each feature function in the model.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Learning

Learning in HMMs involves estimating the model parameters (transition and emission probabilities) from training data using algorithms like the Baum-Welch algorithm.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

learning rate

The learning rate controls how large steps the perceptron takes when updating its weights. Too large and learning becomes unstable; too small and learning becomes very slow.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1957: The Perceptron
View in context

Limited context

Limited context refers to the constraint that HMMs can only look back one step due to the Markov assumption, missing longer-range dependencies in language.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

Limited expressiveness

The model's ability to capture complex patterns and relationships in the data.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

linear classifier

A linear classifier separates data into categories using a straight line (in 2D) or hyperplane (in higher dimensions).

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1957: The Perceptron
View in context

Linear modeling

Models that assume linear relationships between input features and output probabilities.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

linearly separable problems

Linearly separable data can be perfectly separated into classes using a straight line (2D) or hyperplane (higher dimensions). If data points cannot be separated this way, they are linearly inseparable.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1957: The Perceptron
View in context

LMS algorithm

The LMS algorithm became fundamental in adaptive signal processing, providing a stable method for minimizing squared errors through iterative weight adjustments.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1962: Neural Networks (MADALINE)
View in context

LMS weight update

The LMS (Least Mean Squares) algorithm minimizes the squared error between desired and actual outputs by adjusting weights proportionally to the error.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1962: Neural Networks (MADALINE)
View in context

Local decisions

Local decisions refer to making translation choices independently for each word or phrase, missing the global coherence of the entire sentence.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

local minima

Local minima are points where a function has its lowest value in a neighborhood, but not necessarily the globally lowest value.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1962: Neural Networks (MADALINE)
View in context

Local optima

Local optima are solutions that are optimal within a small region but not globally optimal. Training algorithms can get stuck in these suboptimal solutions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

loss landscape

Loss landscape refers to the multi-dimensional surface formed by plotting the loss function against all possible weight combinations.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1986: Backpropagation
View in context

LSTMs

Long Short-Term Memory networks, a type of RNN that solved the vanishing gradient problem through gated mechanisms.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: Recurrent Neural Networks (RNNs)
View in context

M

Machine translation

Machine translation automatically translates text from one language to another, requiring models that can handle unseen patterns in the source language.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

MADALINE

MADALINE (Multiple ADAptive LINear Elements) is an early neural network architecture that uses multiple adaptive linear elements connected in a specific pattern to solve classification problems.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1962: Neural Networks (MADALINE)
View in context

Maintainability

Maintainability refers to how easy it is to update and improve a system. Statistical MT required less linguistic expertise to develop and maintain.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

Markov

A property of a system where the current state depends only on the previous state, not the entire history. This simplifying assumption makes mathematical modeling tractable.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

Markov assumption

The Markov assumption states that the current state depends only on the previous state, not the entire history. In n-gram models, this means predicting the next word based only on the last few words, not the entire sentence.

Introduction → Rule-Based NLP: From Turing to Templates → 1948: Shannon's N-gram Model
View in context

maximum entropy models

A type of probabilistic model that maximizes entropy subject to constraints, often used for classification.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

microworlds

Microworlds are simplified, artificial environments used in AI research to study complex problems in isolation. They allow researchers to focus on specific aspects of intelligence without dealing with the full complexity of the real world.

Introduction → Rule-Based NLP: From Turing to Templates → 1968: SHRDLU
View in context

N

n-gram models

N-grams are contiguous sequences of n words from a text. For example, in 'the cat sat', the bigrams (2-grams) are 'the cat' and 'cat sat', while the trigrams (3-grams) would be 'the cat sat'.

Introduction → A Quick Glance Through the History of Language AI
View in context

n-grams

N-grams are contiguous sequences of n words from a text. For example, in 'the cat sat', the bigrams (2-grams) are 'the cat' and 'cat sat', while the trigrams (3-grams) would be 'the cat sat'.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

named entities

Identifying and classifying named entities like people, organizations, locations, dates, etc. in text.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

named entity recognition

Identifying and classifying named entities like people, organizations, locations, dates, etc. in text.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

natural language processing

Computational techniques for understanding, interpreting, and generating human language.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Neural CRFs

Models that combine the structured prediction capabilities of CRFs with the feature learning abilities of neural networks.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Neural machine translation

Neural machine translation uses neural networks to learn continuous representations of words and sentences, enabling more sophisticated translation models.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

neural methods

Machine learning methods based on neural networks that became dominant in the 2010s and beyond.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

neural network

Neural networks are computational models inspired by biological neural networks. They consist of interconnected nodes (neurons) that process information and can learn patterns from data.

Introduction → Rule-Based NLP: From Turing to Templates → 1948: Shannon's N-gram Model
View in context

Noam Chomsky

Noam Chomsky (1928-) is an American linguist, philosopher, and cognitive scientist who revolutionized the study of language with his theory of generative grammar and the concept of universal grammar.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

noisy channel model

The noisy channel model assumes that the source sentence is a 'corrupted' or 'noisy' version of some original target sentence, and translation is the process of recovering the original clean signal.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

normalization factor

The partition function ensuring probabilities sum to 1 across all possible output sequences.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

noun

Words that name people, places, things, or ideas.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

O

optical character recognition (OCR)

Optical character recognition converts images of text into machine-readable text, enabling computers to process scanned documents.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1957: The Perceptron
View in context

optimal tag sequence

The sequence of labels that maximizes the overall probability given the input.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

output gate

Controls how much of the cell state becomes visible in the hidden state at the current time step.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1997: Long Short-Term Memory (LSTM)
View in context

overfitting

When a model learns the training data too well but fails to generalize to new data

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1988: Convolutional Neural Networks (CNN)
View in context

P

pairwise dependencies

Dependencies between pairs of adjacent elements, rather than more complex multi-element relationships.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

parallel text data

Parallel text data consists of pairs of sentences in different languages that express the same meaning. This data is essential for training statistical translation models.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

parser

A parser is a program that analyzes the grammatical structure of sentences and converts them into a form that can be understood and processed by a computer.

Introduction → Rule-Based NLP: From Turing to Templates → 1968: SHRDLU
View in context

Parsing

Parsing is the process of analyzing a string of symbols according to a formal grammar to determine its grammatical structure. It involves breaking down sentences into their constituent parts.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

part-of-speech tagging

Assigning grammatical categories (like noun, verb, adjective) to each word in a sentence.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: Recurrent Neural Networks (RNNs)
View in context

part-of-speech tags

Grammatical categories assigned to words, such as noun, verb, adjective, etc.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

pattern-matching

Pattern matching is a technique where a computer program searches for specific patterns or sequences in text and responds based on those patterns. It's a fundamental technique in early natural language processing.

Introduction → Rule-Based NLP: From Turing to Templates → 1966: ELIZA
View in context

perceptron

The perceptron is the first artificial neural network that could learn to classify patterns by adjusting its connection weights based on training examples.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1957: The Perceptron
View in context

perceptron learning rule

The perceptron learning rule adjusts weights when the perceptron makes an incorrect prediction, moving the decision boundary toward the correct classification.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1957: The Perceptron
View in context

phonemes

The smallest units of sound in speech that can distinguish one word from another. For example, the sounds /p/, /t/, /k/ are different phonemes in English.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

Phrase-based translation

Phrase-based translation aligns phrases rather than individual words, capturing more complex linguistic patterns and improving translation quality.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

pointwise operations

Pointwise operations that combine information: $\times$ for element-wise multiplication (filtering), $+$ for addition (combining information), and $\tanh$ for applying the hyperbolic tangent activation function.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1997: Long Short-Term Memory (LSTM)
View in context

Pooling layers

A technique that reduces the spatial dimensions of data while preserving important features

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1988: Convolutional Neural Networks (CNN)
View in context

precision

The fraction of generated n-grams that also appear in the references.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2002: BLEU Metric
View in context

preposition

Words that show relationships between other words, often indicating location, time, or direction.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

probabilistic framework

A mathematical framework based on probability theory for modeling uncertainty and making predictions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Probabilistic frameworks

Mathematical frameworks based on probability theory for modeling uncertainty and making predictions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Probabilistic modeling

Probabilistic modeling uses probability theory to represent uncertainty and make predictions based on statistical patterns in data.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

probabilistic models

Models that use probability theory to represent uncertainty and make predictions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Probabilistic outputs

Outputs that include probability scores indicating the model's confidence in its predictions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

probability mass

Probability mass refers to the total probability distributed across all possible outcomes, which must sum to 1. Reserving mass for unseen events ensures the model can handle new data.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

pronoun reference

Pronoun resolution is a specific type of reference resolution that determines what a pronoun (like 'it', 'he', 'she') refers to in context.

Introduction → Rule-Based NLP: From Turing to Templates → 1968: SHRDLU
View in context

R

receptive field

The area of the input that affects a particular output

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1988: Convolutional Neural Networks (CNN)
View in context

recurrent neural networks

Recurrent Neural Networks (RNNs) are neural networks designed to process sequential data by maintaining internal memory through recurrent connections, allowing them to capture temporal dependencies.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Time Delay Neural Networks (TDNN)
View in context

Recurrent Neural Networks (RNNs)

Neural networks with connections that form cycles, allowing them to maintain memory of previous inputs by processing sequences step-by-step.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: Recurrent Neural Networks (RNNs)
View in context

redistributing probability mass

Redistributing probability mass means taking some probability from seen events and allocating it to unseen events, ensuring that all possible outcomes have non-zero probability.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

reference problem

Reference resolution is the process of determining what a pronoun or other referring expression refers to in context. For example, understanding what 'it' refers to in a sentence.

Introduction → Rule-Based NLP: From Turing to Templates → 1968: SHRDLU
View in context

regularization techniques

Regularization techniques like dropout, weight decay, and early stopping help prevent overfitting by constraining the model's complexity or training process.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1986: Backpropagation
View in context

response templates

Response templates are pre-written sentence structures with placeholders that can be filled with words from the user's input. For example, 'Tell me more about your [family member]' where [family member] gets replaced with words like 'mother' or 'father'.

Introduction → Rule-Based NLP: From Turing to Templates → 1966: ELIZA
View in context

RNNs

Networks with feedback connections that process sequences by maintaining a hidden state over time.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1997: Long Short-Term Memory (LSTM)
View in context

Robustness

Robustness refers to how well a system handles unexpected inputs or edge cases. Statistical MT handled unknown words and phrases better than rule-based systems.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

Rogerian psychotherapist

Rogerian therapy, developed by Carl Rogers, is a non-directive form of psychotherapy that emphasizes reflecting the patient's statements back to them rather than offering specific advice or interpretations.

Introduction → Rule-Based NLP: From Turing to Templates → 1966: ELIZA
View in context

S

Scalability

Scalability refers to how well a system can handle increasing amounts of data or complexity. Rule-based systems often become unwieldy as the number of rules grows.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

scaling hypothesis

The scaling hypothesis suggests that performance of language models improves predictably with more data, parameters, and compute. As models scale up, they often reveal unexpected emergent capabilities that weren't present in smaller models.

Introduction → A Quick Glance Through the History of Language AI
View in context

Sequence modeling

Sequence modeling involves processing data that has a temporal or ordered structure, where the order of elements matters.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

simplifying independence assumptions

Assumptions that variables are independent of each other, which can simplify models but may miss important relationships.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Sobel operators

A technique for detecting edges in images by computing gradients

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1988: Convolutional Neural Networks (CNN)
View in context

Sparse alignments

Sparse alignments occur when many word pairs have insufficient training data, leading to poor translation quality for rare or unseen word combinations.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

sparsity problem

Sparsity in language modeling occurs when many possible word combinations have zero or very few occurrences in the training data, making probability estimates unreliable.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

Speech recognition

Speech recognition systems convert spoken words to text, requiring robust language models that can handle variations in pronunciation and unexpected word combinations.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

State features

Features that measure how well a particular label fits the current input word or token.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

statistical and data-driven approaches

Statistical approaches in NLP use probability theory and data analysis to learn patterns from large amounts of text, rather than relying on hand-written rules.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

statistical approach

Statistical approaches in NLP use probability theory and data analysis to learn patterns from large amounts of text, rather than relying on hand-written rules.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

statistical approaches

Statistical approaches in NLP use probability theory and data analysis to learn patterns from large amounts of text, rather than relying on hand-written rules.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

statistical language modeling

Statistical language modeling involves using probability theory to predict the likelihood of word sequences, enabling computers to understand and generate human language.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

statistical problem

A statistical problem involves using probability theory and data analysis to find patterns and make predictions, rather than using deterministic rules.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

structured prediction

Predicting outputs that have internal structure, like sequences, trees, or graphs, rather than simple classifications.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

support vector machines

Support vector machines find optimal linear separators by maximizing the margin between different classes, directly building on perceptron concepts.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1957: The Perceptron
View in context

symbol grounding problem

The symbol grounding problem asks how linguistic symbols (words) connect to real-world meaning. It's the challenge of how abstract symbols can have concrete meaning in the world.

Introduction → Rule-Based NLP: From Turing to Templates → 1968: SHRDLU
View in context

Symbolic systems

Symbolic systems use explicit rules and logical structures to process language, as opposed to statistical or neural approaches that learn patterns from data.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

syntactic parse trees

Tree structures representing the grammatical relationships between words in a sentence.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

syntactic units

Groups of words that function as grammatical units within a sentence.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

T

tanh operations

The hyperbolic tangent activation function that outputs values between -1 and 1, used to create new candidate information and to squash the cell state output.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1997: Long Short-Term Memory (LSTM)
View in context

Text prediction

Text prediction systems suggest the next word or phrase as users type, requiring fast and reliable probability estimates.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

Time Delay Neural Networks (TDNN)

Time Delay Neural Networks (TDNN) are a type of neural network architecture designed to process sequential data by using shared weights across different time steps, making them particularly effective for speech recognition and other temporal pattern recognition tasks.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Time Delay Neural Networks (TDNN)
View in context

time delay units

Time delay units are components that store input values from previous time steps, allowing the network to access historical information when making predictions.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Time Delay Neural Networks (TDNN)
View in context

Training complexity

The computational cost and time required to train the model on large datasets.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Transformational Grammar

Transformational Grammar is a theory of grammar that posits that sentences have both a deep structure (underlying meaning) and a surface structure (actual form), with rules that transform one into the other.

Introduction → Rule-Based NLP: From Turing to Templates → Early Grammars and Symbolic Systems
View in context

Transformer architecture

The Transformer is a neural network architecture that uses self-attention mechanisms to process sequences. It was introduced in 2017 and has become the foundation for most modern language models like BERT and GPT.

Introduction → A Quick Glance Through the History of Language AI
View in context

transformer architectures

Transformer architectures use multiple layers of attention mechanisms to process sequences of data.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1962: Neural Networks (MADALINE)
View in context

transformers

Neural network architecture that uses attention mechanisms to process sequences in parallel rather than sequentially.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: Recurrent Neural Networks (RNNs)
View in context

Transition features

Features that measure how well adjacent labels work together in sequence.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Transition Probabilities

Transition probabilities describe how likely it is to move from one hidden state to another. They capture the dynamics of the underlying system.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1970s: Hidden Markov Models
View in context

trigram model

A trigram model uses sequences of three words to predict the next word. It captures more context than bigrams but requires more training data.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Katz Back-off
View in context

V

valid probability distributions

Probability distributions where all probabilities are non-negative and sum to 1.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

vanishing gradient problem

A problem in deep networks where gradients become very small, making learning difficult

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1988: Convolutional Neural Networks (CNN)
View in context

vehicle is a hypernym of car

A more general term that encompasses specific instances

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: WordNet 1.0
View in context

verb

Words that express actions, states, or occurrences.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

Viterbi

A dynamic programming algorithm for finding the most likely sequence of hidden states in a probabilistic model.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 2001: Conditional Random Fields
View in context

voice activity detection

Voice activity detection determines when speech is present in an audio signal, distinguishing it from background noise or silence.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1962: Neural Networks (MADALINE)
View in context

W

weight sharing

Weight sharing is a technique where the same set of weights is used across different positions or time steps in the input, reducing the number of parameters and allowing the network to learn position-invariant features.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1987: Time Delay Neural Networks (TDNN)
View in context

wheel is a meronym of car

A part or component of something larger

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: WordNet 1.0
View in context

whisper is a troponym of speak

A specific way of performing a more general action

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1995: WordNet 1.0
View in context

word alignment

Word alignment is the process of identifying which words in a source sentence correspond to which words in a target sentence. This is crucial for building translation models.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

word embeddings

Vector representations of words that capture semantic meaning

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1988: Convolutional Neural Networks (CNN)
View in context

Word-level modeling

Word-level modeling focuses on individual word correspondences, missing the broader phrase-level and sentence-level structure that is important for accurate translation.

Introduction → Statistical & Probabilistic Methods: Corpora, Probabilities & Katz Back-off → 1991: IBM Statistical Machine Translation
View in context

Word2Vec

Word2Vec is a technique that learns word embeddings by predicting surrounding words in a text corpus. It represents words as dense vectors in a continuous vector space where semantically similar words are close to each other.

Introduction → A Quick Glance Through the History of Language AI
View in context