A
activation function
An activation function introduces non-linearity into neural networks, allowing them to learn complex patterns. Common examples include sigmoid, tanh, and ReLU functions.
activation functions
An activation function introduces non-linearity into neural networks, allowing them to learn complex patterns. Common examples include sigmoid, tanh, and ReLU functions.
ADALINE units
An adaptive linear element (ADALINE) is a single neuron that can learn to classify patterns by adjusting its weights using the LMS algorithm.
Alignment techniques
Alignment techniques are methods for finding correspondences between different representations, such as words in different languages.
Ambiguity
Ambiguity in language occurs when a sentence or phrase can be interpreted in multiple ways. This is a major challenge for rule-based systems that need to choose the correct interpretation.
artificial intelligence
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. It encompasses various technologies including machine learning, natural language processing, and robotics.
attention mechanisms
Attention mechanisms allow neural networks to focus on different parts of the input sequence when making predictions. They compute a weighted sum of input features, where the weights are learned and indicate the importance of each input element.
B
back-off method
Back-off is a technique that uses shorter n-grams when longer ones have insufficient data, allowing models to handle unseen word combinations gracefully.
backpropagation
Backpropagation is an algorithm for efficiently training multi-layer neural networks by computing gradients of the error with respect to each weight.
bias term
A bias term is a constant value added to the weighted sum of inputs in a neural network, allowing the neuron to shift its activation function and learn more flexible decision boundaries.
Bilingual Evaluation Understudy
Bilingual Evaluation Understudy — an n-gram precision-based automatic MT evaluation metric.
BLEU
BLEU (Bilingual Evaluation Understudy) is an automatic evaluation metric for machine translation that measures how similar a machine translation is to reference human translations.
blocks world
A blocks world is a simplified artificial environment containing geometric objects like blocks, pyramids, and boxes that can be manipulated. It's used in AI research to study language understanding in a controlled, manageable domain.
brevity penalty
A multiplicative penalty discouraging overly short outputs.
brittle
Brittle systems fail completely when encountering unexpected inputs, while robust systems can handle edge cases and continue functioning.
buying entails paying
When one action necessarily implies another action
C
car ↔ automobile
Words that can be used interchangeably in certain contexts
car is a holonym of wheel
A whole that contains smaller parts
car is a hyponym of vehicle
A more specific term that falls under a general category
cell state
A dedicated memory pathway that carries information across time steps with minimal modification.
cell state ($C_t$)
The horizontal line running through the top of the cell, represented as the green pathway that carries information across time steps.
chain rule
The chain rule is a fundamental theorem in calculus that allows us to compute the derivative of a composite function by multiplying the derivatives of its component functions.
ChatGPT's conversational abilities
ChatGPT is a conversational AI model that can engage in natural language dialogue. It was trained using Reinforcement Learning from Human Feedback (RLHF) to make it more helpful, honest, and harmless.
chunking
Identifying groups of words that function as a single unit, like noun phrases or verb phrases.
conditional likelihood
The probability of the correct output sequence given the input sequence.
Conditional modeling
Modeling the probability of outputs given inputs, rather than modeling the joint probability of inputs and outputs together.
conditional probability
The probability of an event occurring given that another event has already occurred.
Consistency
Consistency refers to how uniform the output is across different inputs. Statistical MT produced more uniform translations across different types of text.
Context features
Features that consider the broader context beyond just the current and adjacent elements.
Context-Free Grammars (CFGs)
Context-Free Grammars are formal grammars where each production rule has a single non-terminal symbol on the left side. They are 'context-free' because the rule can be applied regardless of the surrounding context.
continuous
Values that can take on any value within a range (like sound waves, temperatures, or distances)
Convolutional layers
A mathematical operation that applies a filter (kernel) to input data, sliding it across the data to detect patterns
convolutional networks
Convolutional neural networks use multiple layers of feature detectors to build hierarchical representations of input data.
convolutional neural networks
Convolutional Neural Networks (CNNs) are neural networks that use convolutional layers to process data with grid-like topology, such as images or time series data, by applying the same filter across different positions.
credit assignment problem
The credit assignment problem asks how to determine which parts of a system are responsible for errors or successes, particularly in systems with many interconnected components like neural networks.
CRFs
A discriminative probabilistic model that defines P(labels|inputs) for sequences or graphs.
curse of dimensionality
In machine learning, the curse of dimensionality refers to various phenomena that arise when analyzing data in high-dimensional spaces. In language modeling, as n-gram length increases, the number of possible word combinations grows exponentially, making the data increasingly sparse.
D
Data-driven learning
Data-driven learning refers to approaches that learn patterns from data rather than relying on hand-written rules or expert knowledge.
Decoding
Decoding in HMMs finds the most likely sequence of hidden states that could have generated the observed outputs. This is the core of many applications like speech recognition.
deep neural networks
Deep neural networks contain multiple hidden layers that learn hierarchical representations, with each layer building upon features learned by previous layers.
dependencies between adjacent elements
Relationships between elements that are next to each other in the sequence.
Dependency Grammar
Dependency Grammar focuses on the relationships between words rather than phrase structure. Each word (except the root) depends on exactly one other word, creating a tree of dependencies.
determiner
Words like 'the', 'a', 'an' that introduce and specify nouns.
Determiner, Noun, Verb, Preposition, Determiner, Noun
DT=Determiner, NN=Noun, VB=Verb, IN=Preposition - standard abbreviations for grammatical categories.
discount factor
A discount factor reduces the probability of observed events to reserve probability mass for unseen events, preventing overconfidence in sparse data.
Discrete
Values that can only take on specific, separate values (like words, categories, or whole numbers)
dynamic programming
Dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It's used in HMM algorithms like the Viterbi algorithm.
E
edge cases
Edge cases are unusual or extreme situations that test the limits of a system's capabilities, requiring robust handling to prevent failures.
embodied AI
Embodied AI refers to artificial intelligence systems that interact with the physical world through sensors and actuators, rather than being purely computational. It emphasizes the importance of physical interaction for intelligence.
Emission Probabilities
Emission probabilities describe how likely each observation is given a hidden state. They capture the relationship between hidden states and observable outputs.
End-to-end learning
End-to-end learning refers to training systems to perform a complete task directly, without requiring intermediate steps like explicit alignment.
error-driven learning approach
The LMS algorithm established the principle of error-driven learning, where weights are adjusted proportionally to prediction errors.
Evaluation
Evaluation in HMMs determines the probability of observing a sequence of outputs given a model. This is used to score how well different models explain the observed data.
Evaluation metrics
Evaluation metrics are objective measures used to assess the quality of machine learning systems, such as BLEU for translation quality.
expectation-maximization
Expectation-maximization is an iterative algorithm for finding maximum likelihood estimates of parameters in statistical models with hidden variables.
F
factor-graph
A graphical representation showing the relationships between variables in a probabilistic model, where nodes represent variables and edges represent dependencies.
Feature engineering
Feature engineering is the process of manually creating features from raw data that are relevant for machine learning models. HMMs required extensive hand-crafting of features.
feature functions
Functions that fire (often 0/1) for particular input/label configurations, e.g., word shape with a tag.
Feature learning
The ability of neural networks to automatically discover useful features from raw data without manual engineering.
feedforward neural networks
Feedforward neural networks are neural networks where connections between nodes do not form cycles, meaning information flows in one direction from input to output without any feedback loops.
forget gate
Controls which parts of the previous cell state are erased.
formal grammar rules
Formal grammar rules are explicit, mathematical descriptions of how sentences are structured. They define the syntax and relationships between words in a language.
Formal grammars
Formal grammars are mathematical systems for describing the structure of languages. They provide precise rules for generating valid sentences in a language.
frame problem
The frame problem is the challenge of determining which aspects of a situation are relevant when reasoning about actions and their consequences. It's a fundamental problem in AI that makes complete symbolic representation intractable in complex domains.
G
global context
The broader context or overall structure that affects individual decisions.
Global features
Features that capture properties of the entire sequence that affect local decisions.
Global optimization
Finding the best solution by considering the entire sequence at once, rather than making decisions one element at a time.
Good-Turing estimator
The Good-Turing estimator is a statistical technique that estimates the probability of unseen events based on the frequency distribution of seen events. It was originally developed during WWII to help crack the Enigma code.
GPT-3's few-shot learning
Few-shot learning is the ability of a model to learn new tasks from just a few examples, without extensive retraining. GPT-3 demonstrated this capability by performing various tasks after seeing just a few examples.
GPT-4's multimodal capabilities
Multimodal AI can process and understand multiple types of data simultaneously, such as text, images, audio, and video. GPT-4 can analyze images and text together to answer questions about visual content.
gradient descent
Gradient descent is an optimization algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function to find the minimum.
gradient-based learning
Gradient-based learning adjusts model parameters by following the gradient of the loss function, moving toward optimal solutions through iterative updates.
gradients
Gradients are vectors of partial derivatives that indicate how much a function changes with respect to each of its input variables. In neural networks, they show how the loss changes with respect to each weight.
grammatical reflection
Grammatical reflection is a technique where a statement is transformed into a question by changing its grammatical structure. For example, 'I am sad' becomes 'Why do you think you are sad?'
greedy local decisions
Making decisions one at a time based only on local information, without considering the full context.
GRUs
Gated Recurrent Units, a simplified version of LSTM with fewer parameters but similar performance.
H
hand-crafted features
Features that are manually designed and crafted by domain experts.
Handling dependencies
Explicitly modeling the relationships between different elements in the sequence.
Harris corners
A method for detecting corners and interest points in images
Hidden Markov Models
A statistical model that describes a system with hidden states that influence observable outputs. The system follows the Markov property, meaning the current state depends only on the previous state.
Hidden Markov Models (HMMs)
A statistical model that describes a system with hidden states that influence observable outputs. The system follows the Markov property, meaning the current state depends only on the previous state.
hidden state
An internal representation that carries information from previous time steps, acting as the network's memory.
Hidden state inference
Hidden state inference is the process of determining the most likely hidden states given observed outputs, which is a key capability of HMMs.
hidden states
Hidden states are the underlying system states that cannot be directly observed but influence the observable outputs. In HMMs, these states follow a Markov process.
hierarchical approach
A hierarchical approach processes data at multiple levels, starting with the most specific (longest n-grams) and falling back to more general (shorter n-grams) when needed.
hot ↔ cold
Words with contrasting or opposite meanings
Hybrid systems
Hybrid systems combine multiple approaches, such as symbolic rules with statistical learning, to leverage the strengths of each method.
I
IBM Model 1
IBM Model 1 was the simplest IBM translation model that aligned words one-to-one between languages. While limited, it established the basic statistical framework for machine translation.
IBM Model 2
IBM Model 2 introduced alignment probabilities to handle the fact that word order differs between languages, making it more realistic than Model 1.
IBM Model 3
IBM Model 3 added fertility modeling to handle cases where one source word translates to multiple target words, addressing a key limitation of earlier models.
IBM Model 4
IBM Model 4 introduced distortion modeling to capture how word positions change during translation, making it more accurate for languages with different word orders.
IBM Model 5
IBM Model 5 refined the distortion modeling and made the training process more stable, representing the most sophisticated of the IBM models.
independent predictions
Making predictions for each element independently, without considering relationships between elements.
information
In information theory, 'information' refers to the amount of uncertainty or surprise in a message. The more unpredictable something is, the more information it carries. This is measured in bits.
information extraction
Extracting structured information (like names, dates, relationships) from unstructured text.
information theory
Information theory is a branch of mathematics that studies the quantification, storage, and communication of information. It was developed by Claude Shannon and provides the mathematical foundation for data compression and error correction.
input gate
Controls which new information enters the cell state.
interdependent
When the prediction of one element affects or is affected by the predictions of other elements in the sequence.
interpolation
Interpolation refers to combining multiple probability estimates, often using a weighted average to create a more robust final estimate.
J
joint probability
The probability of two or more events occurring together.
K
Katz back-off
Katz back-off is a smoothing technique that handles unseen word sequences by 'backing off' to shorter n-grams when longer ones aren't available in the training data.
kernel size
Kernel size refers to the number of time steps or spatial dimensions that a filter or convolution operation covers in a single computation.
L
Language modeling
Language modeling involves predicting the probability of word sequences, forming the foundation for many NLP applications.
language models
AI models designed to understand, generate, and work with human language.
learned weights
Parameters that determine the importance of each feature function in the model.
Learning
Learning in HMMs involves estimating the model parameters (transition and emission probabilities) from training data using algorithms like the Baum-Welch algorithm.
learning rate
The learning rate controls how large steps the perceptron takes when updating its weights. Too large and learning becomes unstable; too small and learning becomes very slow.
Limited context
Limited context refers to the constraint that HMMs can only look back one step due to the Markov assumption, missing longer-range dependencies in language.
Limited expressiveness
The model's ability to capture complex patterns and relationships in the data.
linear classifier
A linear classifier separates data into categories using a straight line (in 2D) or hyperplane (in higher dimensions).
Linear modeling
Models that assume linear relationships between input features and output probabilities.
linearly separable problems
Linearly separable data can be perfectly separated into classes using a straight line (2D) or hyperplane (higher dimensions). If data points cannot be separated this way, they are linearly inseparable.
LMS algorithm
The LMS algorithm became fundamental in adaptive signal processing, providing a stable method for minimizing squared errors through iterative weight adjustments.
LMS weight update
The LMS (Least Mean Squares) algorithm minimizes the squared error between desired and actual outputs by adjusting weights proportionally to the error.
Local decisions
Local decisions refer to making translation choices independently for each word or phrase, missing the global coherence of the entire sentence.
local minima
Local minima are points where a function has its lowest value in a neighborhood, but not necessarily the globally lowest value.
Local optima
Local optima are solutions that are optimal within a small region but not globally optimal. Training algorithms can get stuck in these suboptimal solutions.
loss landscape
Loss landscape refers to the multi-dimensional surface formed by plotting the loss function against all possible weight combinations.
LSTMs
Long Short-Term Memory networks, a type of RNN that solved the vanishing gradient problem through gated mechanisms.
M
Machine translation
Machine translation automatically translates text from one language to another, requiring models that can handle unseen patterns in the source language.
MADALINE
MADALINE (Multiple ADAptive LINear Elements) is an early neural network architecture that uses multiple adaptive linear elements connected in a specific pattern to solve classification problems.
Maintainability
Maintainability refers to how easy it is to update and improve a system. Statistical MT required less linguistic expertise to develop and maintain.
Markov
A property of a system where the current state depends only on the previous state, not the entire history. This simplifying assumption makes mathematical modeling tractable.
Markov assumption
The Markov assumption states that the current state depends only on the previous state, not the entire history. In n-gram models, this means predicting the next word based only on the last few words, not the entire sentence.
maximum entropy models
A type of probabilistic model that maximizes entropy subject to constraints, often used for classification.
microworlds
Microworlds are simplified, artificial environments used in AI research to study complex problems in isolation. They allow researchers to focus on specific aspects of intelligence without dealing with the full complexity of the real world.
N
n-gram models
N-grams are contiguous sequences of n words from a text. For example, in 'the cat sat', the bigrams (2-grams) are 'the cat' and 'cat sat', while the trigrams (3-grams) would be 'the cat sat'.
n-grams
N-grams are contiguous sequences of n words from a text. For example, in 'the cat sat', the bigrams (2-grams) are 'the cat' and 'cat sat', while the trigrams (3-grams) would be 'the cat sat'.
named entities
Identifying and classifying named entities like people, organizations, locations, dates, etc. in text.
named entity recognition
Identifying and classifying named entities like people, organizations, locations, dates, etc. in text.
natural language processing
Computational techniques for understanding, interpreting, and generating human language.
Neural CRFs
Models that combine the structured prediction capabilities of CRFs with the feature learning abilities of neural networks.
Neural machine translation
Neural machine translation uses neural networks to learn continuous representations of words and sentences, enabling more sophisticated translation models.
neural methods
Machine learning methods based on neural networks that became dominant in the 2010s and beyond.
neural network
Neural networks are computational models inspired by biological neural networks. They consist of interconnected nodes (neurons) that process information and can learn patterns from data.
Noam Chomsky
Noam Chomsky (1928-) is an American linguist, philosopher, and cognitive scientist who revolutionized the study of language with his theory of generative grammar and the concept of universal grammar.
noisy channel model
The noisy channel model assumes that the source sentence is a 'corrupted' or 'noisy' version of some original target sentence, and translation is the process of recovering the original clean signal.
normalization factor
The partition function ensuring probabilities sum to 1 across all possible output sequences.
noun
Words that name people, places, things, or ideas.
O
optical character recognition (OCR)
Optical character recognition converts images of text into machine-readable text, enabling computers to process scanned documents.
optimal tag sequence
The sequence of labels that maximizes the overall probability given the input.
output gate
Controls how much of the cell state becomes visible in the hidden state at the current time step.
overfitting
When a model learns the training data too well but fails to generalize to new data
P
pairwise dependencies
Dependencies between pairs of adjacent elements, rather than more complex multi-element relationships.
parallel text data
Parallel text data consists of pairs of sentences in different languages that express the same meaning. This data is essential for training statistical translation models.
parser
A parser is a program that analyzes the grammatical structure of sentences and converts them into a form that can be understood and processed by a computer.
Parsing
Parsing is the process of analyzing a string of symbols according to a formal grammar to determine its grammatical structure. It involves breaking down sentences into their constituent parts.
part-of-speech tagging
Assigning grammatical categories (like noun, verb, adjective) to each word in a sentence.
part-of-speech tags
Grammatical categories assigned to words, such as noun, verb, adjective, etc.
pattern-matching
Pattern matching is a technique where a computer program searches for specific patterns or sequences in text and responds based on those patterns. It's a fundamental technique in early natural language processing.
perceptron
The perceptron is the first artificial neural network that could learn to classify patterns by adjusting its connection weights based on training examples.
perceptron learning rule
The perceptron learning rule adjusts weights when the perceptron makes an incorrect prediction, moving the decision boundary toward the correct classification.
phonemes
The smallest units of sound in speech that can distinguish one word from another. For example, the sounds /p/, /t/, /k/ are different phonemes in English.
Phrase-based translation
Phrase-based translation aligns phrases rather than individual words, capturing more complex linguistic patterns and improving translation quality.
pointwise operations
Pointwise operations that combine information: $\times$ for element-wise multiplication (filtering), $+$ for addition (combining information), and $\tanh$ for applying the hyperbolic tangent activation function.
Pooling layers
A technique that reduces the spatial dimensions of data while preserving important features
precision
The fraction of generated n-grams that also appear in the references.
preposition
Words that show relationships between other words, often indicating location, time, or direction.
probabilistic framework
A mathematical framework based on probability theory for modeling uncertainty and making predictions.
Probabilistic frameworks
Mathematical frameworks based on probability theory for modeling uncertainty and making predictions.
Probabilistic modeling
Probabilistic modeling uses probability theory to represent uncertainty and make predictions based on statistical patterns in data.
probabilistic models
Models that use probability theory to represent uncertainty and make predictions.
Probabilistic outputs
Outputs that include probability scores indicating the model's confidence in its predictions.
probability mass
Probability mass refers to the total probability distributed across all possible outcomes, which must sum to 1. Reserving mass for unseen events ensures the model can handle new data.
pronoun reference
Pronoun resolution is a specific type of reference resolution that determines what a pronoun (like 'it', 'he', 'she') refers to in context.
R
receptive field
The area of the input that affects a particular output
recurrent neural networks
Recurrent Neural Networks (RNNs) are neural networks designed to process sequential data by maintaining internal memory through recurrent connections, allowing them to capture temporal dependencies.
Recurrent Neural Networks (RNNs)
Neural networks with connections that form cycles, allowing them to maintain memory of previous inputs by processing sequences step-by-step.
redistributing probability mass
Redistributing probability mass means taking some probability from seen events and allocating it to unseen events, ensuring that all possible outcomes have non-zero probability.
reference problem
Reference resolution is the process of determining what a pronoun or other referring expression refers to in context. For example, understanding what 'it' refers to in a sentence.
regularization techniques
Regularization techniques like dropout, weight decay, and early stopping help prevent overfitting by constraining the model's complexity or training process.
response templates
Response templates are pre-written sentence structures with placeholders that can be filled with words from the user's input. For example, 'Tell me more about your [family member]' where [family member] gets replaced with words like 'mother' or 'father'.
RNNs
Networks with feedback connections that process sequences by maintaining a hidden state over time.
Robustness
Robustness refers to how well a system handles unexpected inputs or edge cases. Statistical MT handled unknown words and phrases better than rule-based systems.
Rogerian psychotherapist
Rogerian therapy, developed by Carl Rogers, is a non-directive form of psychotherapy that emphasizes reflecting the patient's statements back to them rather than offering specific advice or interpretations.
S
Scalability
Scalability refers to how well a system can handle increasing amounts of data or complexity. Rule-based systems often become unwieldy as the number of rules grows.
scaling hypothesis
The scaling hypothesis suggests that performance of language models improves predictably with more data, parameters, and compute. As models scale up, they often reveal unexpected emergent capabilities that weren't present in smaller models.
Sequence modeling
Sequence modeling involves processing data that has a temporal or ordered structure, where the order of elements matters.
simplifying independence assumptions
Assumptions that variables are independent of each other, which can simplify models but may miss important relationships.
Sobel operators
A technique for detecting edges in images by computing gradients
Sparse alignments
Sparse alignments occur when many word pairs have insufficient training data, leading to poor translation quality for rare or unseen word combinations.
sparsity problem
Sparsity in language modeling occurs when many possible word combinations have zero or very few occurrences in the training data, making probability estimates unreliable.
Speech recognition
Speech recognition systems convert spoken words to text, requiring robust language models that can handle variations in pronunciation and unexpected word combinations.
State features
Features that measure how well a particular label fits the current input word or token.
statistical and data-driven approaches
Statistical approaches in NLP use probability theory and data analysis to learn patterns from large amounts of text, rather than relying on hand-written rules.
statistical approach
Statistical approaches in NLP use probability theory and data analysis to learn patterns from large amounts of text, rather than relying on hand-written rules.
statistical approaches
Statistical approaches in NLP use probability theory and data analysis to learn patterns from large amounts of text, rather than relying on hand-written rules.
statistical language modeling
Statistical language modeling involves using probability theory to predict the likelihood of word sequences, enabling computers to understand and generate human language.
statistical problem
A statistical problem involves using probability theory and data analysis to find patterns and make predictions, rather than using deterministic rules.
structured prediction
Predicting outputs that have internal structure, like sequences, trees, or graphs, rather than simple classifications.
support vector machines
Support vector machines find optimal linear separators by maximizing the margin between different classes, directly building on perceptron concepts.
symbol grounding problem
The symbol grounding problem asks how linguistic symbols (words) connect to real-world meaning. It's the challenge of how abstract symbols can have concrete meaning in the world.
Symbolic systems
Symbolic systems use explicit rules and logical structures to process language, as opposed to statistical or neural approaches that learn patterns from data.
syntactic parse trees
Tree structures representing the grammatical relationships between words in a sentence.
syntactic units
Groups of words that function as grammatical units within a sentence.
T
tanh operations
The hyperbolic tangent activation function that outputs values between -1 and 1, used to create new candidate information and to squash the cell state output.
Text prediction
Text prediction systems suggest the next word or phrase as users type, requiring fast and reliable probability estimates.
Time Delay Neural Networks (TDNN)
Time Delay Neural Networks (TDNN) are a type of neural network architecture designed to process sequential data by using shared weights across different time steps, making them particularly effective for speech recognition and other temporal pattern recognition tasks.
time delay units
Time delay units are components that store input values from previous time steps, allowing the network to access historical information when making predictions.
Training complexity
The computational cost and time required to train the model on large datasets.
Transformational Grammar
Transformational Grammar is a theory of grammar that posits that sentences have both a deep structure (underlying meaning) and a surface structure (actual form), with rules that transform one into the other.
Transformer architecture
The Transformer is a neural network architecture that uses self-attention mechanisms to process sequences. It was introduced in 2017 and has become the foundation for most modern language models like BERT and GPT.
transformer architectures
Transformer architectures use multiple layers of attention mechanisms to process sequences of data.
transformers
Neural network architecture that uses attention mechanisms to process sequences in parallel rather than sequentially.
Transition features
Features that measure how well adjacent labels work together in sequence.
Transition Probabilities
Transition probabilities describe how likely it is to move from one hidden state to another. They capture the dynamics of the underlying system.
trigram model
A trigram model uses sequences of three words to predict the next word. It captures more context than bigrams but requires more training data.
V
valid probability distributions
Probability distributions where all probabilities are non-negative and sum to 1.
vanishing gradient problem
A problem in deep networks where gradients become very small, making learning difficult
vehicle is a hypernym of car
A more general term that encompasses specific instances
verb
Words that express actions, states, or occurrences.
Viterbi
A dynamic programming algorithm for finding the most likely sequence of hidden states in a probabilistic model.
voice activity detection
Voice activity detection determines when speech is present in an audio signal, distinguishing it from background noise or silence.
W
weight sharing
Weight sharing is a technique where the same set of weights is used across different positions or time steps in the input, reducing the number of parameters and allowing the network to learn position-invariant features.
wheel is a meronym of car
A part or component of something larger
whisper is a troponym of speak
A specific way of performing a more general action
word alignment
Word alignment is the process of identifying which words in a source sentence correspond to which words in a target sentence. This is crucial for building translation models.
word embeddings
Vector representations of words that capture semantic meaning
Word-level modeling
Word-level modeling focuses on individual word correspondences, missing the broader phrase-level and sentence-level structure that is important for accurate translation.
Word2Vec
Word2Vec is a technique that learns word embeddings by predicting surrounding words in a text corpus. It represents words as dense vectors in a continuous vector space where semantically similar words are close to each other.