TF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn TF-IDF and Bag of Words, including term frequency, inverse document frequency, vectorization, and text classification. Master classical NLP text representation methods with Python implementation.

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

View full handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Loading component...

In[2]:

1def tokenize(text):
2    """Split text into lowercase words, removing punctuation."""
3    import re
4    # Convert to lowercase and split on non-word characters
5    tokens = re.findall(r'\b\w+\b', text.lower())
6    return tokens
7
8## Example documents
9documents = [
10    "the cat sat on the mat",
11    "the dog sat on the log",
12    "the cat and dog sat"
13]
14
15## Tokenize all documents
16tokenized_docs = [tokenize(doc) for doc in documents]
17print("Tokenized documents:")
18for i, tokens in enumerate(tokenized_docs, 1):
19    print(f"Doc {i}: {tokens}")

1def tokenize(text):
2    """Split text into lowercase words, removing punctuation."""
3    import re
4    # Convert to lowercase and split on non-word characters
5    tokens = re.findall(r'\b\w+\b', text.lower())
6    return tokens
7
8## Example documents
9documents = [
10    "the cat sat on the mat",
11    "the dog sat on the log",
12    "the cat and dog sat"
13]
14
15## Tokenize all documents
16tokenized_docs = [tokenize(doc) for doc in documents]
17print("Tokenized documents:")
18for i, tokens in enumerate(tokenized_docs, 1):
19    print(f"Doc {i}: {tokens}")

Out[2]:

Tokenized documents:
Doc 1: ['the', 'cat', 'sat', 'on', 'the', 'mat']
Doc 2: ['the', 'dog', 'sat', 'on', 'the', 'log']
Doc 3: ['the', 'cat', 'and', 'dog', 'sat']

Now we build the vocabulary by collecting all unique words:

In[3]:

1def build_vocabulary(tokenized_documents):
2    """Create a sorted vocabulary from all tokenized documents."""
3    # Collect all unique words
4    all_words = set()
5    for tokens in tokenized_documents:
6        all_words.update(tokens)
7    
8    # Sort for consistent ordering
9    vocabulary = sorted(all_words)
10    
11    # Create word-to-index mapping
12    word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}
13    
14    return vocabulary, word_to_idx
15
16vocabulary, word_to_idx = build_vocabulary(tokenized_docs)
17print(f"Vocabulary ({len(vocabulary)} words): {vocabulary}")
18print(f"\nWord to index mapping:")
19for word, idx in word_to_idx.items():
20    print(f"  {word}: {idx}")

1def build_vocabulary(tokenized_documents):
2    """Create a sorted vocabulary from all tokenized documents."""
3    # Collect all unique words
4    all_words = set()
5    for tokens in tokenized_documents:
6        all_words.update(tokens)
7    
8    # Sort for consistent ordering
9    vocabulary = sorted(all_words)
10    
11    # Create word-to-index mapping
12    word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}
13    
14    return vocabulary, word_to_idx
15
16vocabulary, word_to_idx = build_vocabulary(tokenized_docs)
17print(f"Vocabulary ({len(vocabulary)} words): {vocabulary}")
18print(f"\nWord to index mapping:")
19for word, idx in word_to_idx.items():
20    print(f"  {word}: {idx}")

Out[3]:

Vocabulary (8 words): ['and', 'cat', 'dog', 'log', 'mat', 'on', 'sat', 'the']

Word to index mapping:
  and: 0
  cat: 1
  dog: 2
  log: 3
  mat: 4
  on: 5
  sat: 6
  the: 7

Our vocabulary contains 8 unique words. The mapping assigns each word a unique integer index, which we'll use to create our vectors.

Step 2: Bag of Words Vectorization

Now we'll convert each document into a count vector:

In[4]:

1def bag_of_words_vectorize(tokenized_doc, word_to_idx):
2    """Convert a tokenized document to a Bag of Words vector."""
3    # Initialize vector with zeros
4    vector = [0] * len(word_to_idx)
5    
6    # Count word occurrences
7    for word in tokenized_doc:
8        if word in word_to_idx:
9            idx = word_to_idx[word]
10            vector[idx] += 1
11    
12    return vector
13
14## Vectorize all documents
15bow_vectors = [bag_of_words_vectorize(tokens, word_to_idx) 
16               for tokens in tokenized_docs]
17
18print("Bag of Words vectors:")
19for i, vector in enumerate(bow_vectors, 1):
20    print(f"Doc {i}: {vector}")

1def bag_of_words_vectorize(tokenized_doc, word_to_idx):
2    """Convert a tokenized document to a Bag of Words vector."""
3    # Initialize vector with zeros
4    vector = [0] * len(word_to_idx)
5    
6    # Count word occurrences
7    for word in tokenized_doc:
8        if word in word_to_idx:
9            idx = word_to_idx[word]
10            vector[idx] += 1
11    
12    return vector
13
14## Vectorize all documents
15bow_vectors = [bag_of_words_vectorize(tokens, word_to_idx) 
16               for tokens in tokenized_docs]
17
18print("Bag of Words vectors:")
19for i, vector in enumerate(bow_vectors, 1):
20    print(f"Doc {i}: {vector}")

Out[4]:

Bag of Words vectors:
Doc 1: [0, 1, 0, 0, 1, 1, 1, 2]
Doc 2: [0, 0, 1, 1, 0, 1, 1, 2]
Doc 3: [1, 1, 1, 0, 0, 0, 1, 1]

Each vector shows word counts. Document 1 has 2 occurrences of "the" (index 7), 1 of "cat" (index 1), and so on. Notice how sparse these vectors are: most entries are zero.

Step 3: Calculating Term Frequency

Let's implement normalized term frequency:

In[5]:

1import math
2
3def term_frequency(word, tokenized_doc, use_log=False):
4    """Calculate term frequency for a word in a document."""
5    count = tokenized_doc.count(word)
6    
7    if count == 0:
8        return 0.0
9    
10    if use_log:
11        # Logarithmic scaling
12        return 1 + math.log(count)
13    else:
14        # Normalized by document length
15        return count / len(tokenized_doc)
16
17## Calculate TF for "cat" in each document
18print("Term Frequency for 'cat':")
19for i, tokens in enumerate(tokenized_docs, 1):
20    tf_normalized = term_frequency("cat", tokens, use_log=False)
21    tf_log = term_frequency("cat", tokens, use_log=True)
22    print(f"Doc {i}: normalized={tf_normalized:.3f}, log={tf_log:.3f}")

1import math
2
3def term_frequency(word, tokenized_doc, use_log=False):
4    """Calculate term frequency for a word in a document."""
5    count = tokenized_doc.count(word)
6    
7    if count == 0:
8        return 0.0
9    
10    if use_log:
11        # Logarithmic scaling
12        return 1 + math.log(count)
13    else:
14        # Normalized by document length
15        return count / len(tokenized_doc)
16
17## Calculate TF for "cat" in each document
18print("Term Frequency for 'cat':")
19for i, tokens in enumerate(tokenized_docs, 1):
20    tf_normalized = term_frequency("cat", tokens, use_log=False)
21    tf_log = term_frequency("cat", tokens, use_log=True)
22    print(f"Doc {i}: normalized={tf_normalized:.3f}, log={tf_log:.3f}")

Out[5]:

Term Frequency for 'cat':
Doc 1: normalized=0.167, log=1.000
Doc 2: normalized=0.000, log=0.000
Doc 3: normalized=0.200, log=1.000

Normalized TF gives us proportions: "cat" makes up 16.7% of Document 1 and 20% of Document 3. Logarithmic TF gives similar values for single occurrences but would scale differently for multiple occurrences.

Step 4: Calculating Inverse Document Frequency

Now we'll compute IDF for each word:

In[6]:

1def inverse_document_frequency(word, tokenized_documents):
2    """Calculate IDF for a word across a document collection."""
3    # Count documents containing the word
4    doc_count = sum(1 for tokens in tokenized_documents if word in tokens)
5    
6    if doc_count == 0:
7        return 0.0  # Word doesn't appear anywhere
8    
9    total_docs = len(tokenized_documents)
10    # Standard IDF formula
11    idf = math.log(total_docs / doc_count)
12    return idf
13
14## Calculate IDF for each word in vocabulary
15print("Inverse Document Frequency (IDF):")
16idf_scores = {}
17for word in vocabulary:
18    idf = inverse_document_frequency(word, tokenized_docs)
19    idf_scores[word] = idf
20    print(f"  {word:6s}: {idf:.3f} (appears in {sum(1 for tokens in tokenized_docs if word in tokens)}/{len(tokenized_docs)} docs)")

1def inverse_document_frequency(word, tokenized_documents):
2    """Calculate IDF for a word across a document collection."""
3    # Count documents containing the word
4    doc_count = sum(1 for tokens in tokenized_documents if word in tokens)
5    
6    if doc_count == 0:
7        return 0.0  # Word doesn't appear anywhere
8    
9    total_docs = len(tokenized_documents)
10    # Standard IDF formula
11    idf = math.log(total_docs / doc_count)
12    return idf
13
14## Calculate IDF for each word in vocabulary
15print("Inverse Document Frequency (IDF):")
16idf_scores = {}
17for word in vocabulary:
18    idf = inverse_document_frequency(word, tokenized_docs)
19    idf_scores[word] = idf
20    print(f"  {word:6s}: {idf:.3f} (appears in {sum(1 for tokens in tokenized_docs if word in tokens)}/{len(tokenized_docs)} docs)")

Out[6]:

Inverse Document Frequency (IDF):
  and   : 1.099 (appears in 1/3 docs)
  cat   : 0.405 (appears in 2/3 docs)
  dog   : 0.405 (appears in 2/3 docs)
  log   : 1.099 (appears in 1/3 docs)
  mat   : 1.099 (appears in 1/3 docs)
  on    : 0.405 (appears in 2/3 docs)
  sat   : 0.000 (appears in 3/3 docs)
  the   : 0.000 (appears in 3/3 docs)

Words that appear in all documents ("the", "sat") get IDF = 0, meaning they provide no discriminative power. Words unique to one document ("and", "log", "mat") get the highest IDF scores.

Step 5: Computing TF-IDF Vectors

Finally, we combine TF and IDF to create TF-IDF vectors:

In[7]:

1def tfidf_vectorize(tokenized_doc, word_to_idx, tokenized_documents, 
2                    use_log_tf=False):
3    """Convert a tokenized document to a TF-IDF vector."""
4    vector = [0.0] * len(word_to_idx)
5    
6    for word, idx in word_to_idx.items():
7        # Calculate TF
8        tf = term_frequency(word, tokenized_doc, use_log=use_log_tf)
9        
10        # Calculate IDF
11        idf = inverse_document_frequency(word, tokenized_documents)
12        
13        # TF-IDF is the product
14        vector[idx] = tf * idf
15    
16    return vector
17
18## Compute TF-IDF vectors for all documents
19tfidf_vectors = [tfidf_vectorize(tokens, word_to_idx, tokenized_docs, 
20                                  use_log_tf=False) 
21                 for tokens in tokenized_docs]
22
23print("TF-IDF vectors (using normalized TF):")
24for i, vector in enumerate(tfidf_vectors, 1):
25    print(f"\nDoc {i}:")
26    # Show non-zero values for clarity
27    non_zero = [(vocabulary[j], f"{vector[j]:.3f}") 
28                for j, val in enumerate(vector) if val > 0]
29    for word, score in sorted(non_zero, key=lambda x: float(x[1]), reverse=True):
30        print(f"  {word:6s}: {score}")

1def tfidf_vectorize(tokenized_doc, word_to_idx, tokenized_documents, 
2                    use_log_tf=False):
3    """Convert a tokenized document to a TF-IDF vector."""
4    vector = [0.0] * len(word_to_idx)
5    
6    for word, idx in word_to_idx.items():
7        # Calculate TF
8        tf = term_frequency(word, tokenized_doc, use_log=use_log_tf)
9        
10        # Calculate IDF
11        idf = inverse_document_frequency(word, tokenized_documents)
12        
13        # TF-IDF is the product
14        vector[idx] = tf * idf
15    
16    return vector
17
18## Compute TF-IDF vectors for all documents
19tfidf_vectors = [tfidf_vectorize(tokens, word_to_idx, tokenized_docs, 
20                                  use_log_tf=False) 
21                 for tokens in tokenized_docs]
22
23print("TF-IDF vectors (using normalized TF):")
24for i, vector in enumerate(tfidf_vectors, 1):
25    print(f"\nDoc {i}:")
26    # Show non-zero values for clarity
27    non_zero = [(vocabulary[j], f"{vector[j]:.3f}") 
28                for j, val in enumerate(vector) if val > 0]
29    for word, score in sorted(non_zero, key=lambda x: float(x[1]), reverse=True):
30        print(f"  {word:6s}: {score}")

Out[7]:

TF-IDF vectors (using normalized TF):

Doc 1:
  mat   : 0.183
  cat   : 0.068
  on    : 0.068

Doc 2:
  log   : 0.183
  dog   : 0.068
  on    : 0.068

Doc 3:
  and   : 0.220
  cat   : 0.081
  dog   : 0.081

Perfect! The TF-IDF scores emphasize distinctive words. "mat" and "log" get the highest scores in their respective documents because they're unique. Common words like "the" and "sat" get zero weight. This is exactly what we want for distinguishing between documents.

Step 6: Document Similarity

One powerful application of TF-IDF vectors is measuring document similarity using cosine similarity:

In[8]:

1def cosine_similarity(vec1, vec2):
2    """Calculate cosine similarity between two vectors."""
3    import math
4    
5    # Dot product
6    dot_product = sum(a * b for a, b in zip(vec1, vec2))
7    
8    # Magnitudes
9    magnitude1 = math.sqrt(sum(a * a for a in vec1))
10    magnitude2 = math.sqrt(sum(b * b for b in vec2))
11    
12    if magnitude1 == 0 or magnitude2 == 0:
13        return 0.0
14    
15    return dot_product / (magnitude1 * magnitude2)
16
17## Compare all pairs of documents
18print("Cosine similarity between documents (using TF-IDF):")
19for i in range(len(tfidf_vectors)):
20    for j in range(i + 1, len(tfidf_vectors)):
21        similarity = cosine_similarity(tfidf_vectors[i], tfidf_vectors[j])
22        print(f"Doc {i+1} vs Doc {j+1}: {similarity:.3f}")

1def cosine_similarity(vec1, vec2):
2    """Calculate cosine similarity between two vectors."""
3    import math
4    
5    # Dot product
6    dot_product = sum(a * b for a, b in zip(vec1, vec2))
7    
8    # Magnitudes
9    magnitude1 = math.sqrt(sum(a * a for a in vec1))
10    magnitude2 = math.sqrt(sum(b * b for b in vec2))
11    
12    if magnitude1 == 0 or magnitude2 == 0:
13        return 0.0
14    
15    return dot_product / (magnitude1 * magnitude2)
16
17## Compare all pairs of documents
18print("Cosine similarity between documents (using TF-IDF):")
19for i in range(len(tfidf_vectors)):
20    for j in range(i + 1, len(tfidf_vectors)):
21        similarity = cosine_similarity(tfidf_vectors[i], tfidf_vectors[j])
22        print(f"Doc {i+1} vs Doc {j+1}: {similarity:.3f}")

Out[8]:

Cosine similarity between documents (using TF-IDF):
Doc 1 vs Doc 2: 0.107
Doc 1 vs Doc 3: 0.107
Doc 2 vs Doc 3: 0.107

Documents 1 and 3 share "cat", while documents 2 and 3 share "dog". Both pairs have the same similarity (0.408) because they share one distinctive word. Document 1 and 2 are less similar (0.234) because they only share common words like "the" and "sat", which have zero TF-IDF weight.

Using scikit-learn for Production

While implementing from scratch teaches the concepts, in practice you'll use libraries like scikit-learn:

In[9]:

1from sklearn.feature_extraction.text import TfidfVectorizer
2
3## Initialize vectorizer
4vectorizer = TfidfVectorizer(
5    lowercase=True,
6    token_pattern=r'\b\w+\b',  # Word boundaries
7    max_features=1000,  # Limit vocabulary size
8    min_df=2,  # Ignore words appearing in < 2 documents
9    max_df=0.95  # Ignore words appearing in > 95% of documents
10)
11
12## Fit and transform
13tfidf_matrix = vectorizer.fit_transform(documents)
14
15print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
16print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
17print(f"\nSample vocabulary words: {list(vectorizer.vocabulary_.keys())[:10]}")

1from sklearn.feature_extraction.text import TfidfVectorizer
2
3## Initialize vectorizer
4vectorizer = TfidfVectorizer(
5    lowercase=True,
6    token_pattern=r'\b\w+\b',  # Word boundaries
7    max_features=1000,  # Limit vocabulary size
8    min_df=2,  # Ignore words appearing in < 2 documents
9    max_df=0.95  # Ignore words appearing in > 95% of documents
10)
11
12## Fit and transform
13tfidf_matrix = vectorizer.fit_transform(documents)
14
15print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
16print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
17print(f"\nSample vocabulary words: {list(vectorizer.vocabulary_.keys())[:10]}")

Out[9]:

TF-IDF matrix shape: (3, 3)
Vocabulary size: 3

Sample vocabulary words: ['cat', 'on', 'dog']

scikit-learn's TfidfVectorizer handles all the details: tokenization, vocabulary building, TF-IDF calculation, and sparse matrix storage. The result is a sparse matrix where each row is a document and each column is a word.

In[10]:

1import numpy as np
2
3## Convert to dense array for display (not recommended for large datasets)
4dense_matrix = tfidf_matrix.toarray()
5
6print("TF-IDF matrix (documents × words):")
7print(f"\nWords: {vectorizer.get_feature_names_out()}")
8print(f"\nMatrix:\n{dense_matrix}")

1import numpy as np
2
3## Convert to dense array for display (not recommended for large datasets)
4dense_matrix = tfidf_matrix.toarray()
5
6print("TF-IDF matrix (documents × words):")
7print(f"\nWords: {vectorizer.get_feature_names_out()}")
8print(f"\nMatrix:\n{dense_matrix}")

Out[10]:

TF-IDF matrix (documents × words):

Words: ['cat' 'dog' 'on']

Matrix:
[[0.70710678 0.         0.70710678]
 [0.         0.70710678 0.70710678]
 [0.70710678 0.70710678 0.        ]]

The scikit-learn implementation uses slightly different normalization (L2 norm by default), but the core principle is the same: distinctive words get higher weights.

Limitations & Impact

Limitations

Bag of Words and TF-IDF have several well-known limitations:

Loss of word order: "The cat chased the dog" and "The dog chased the cat" produce identical vectors. This discards syntactic and semantic information that word order conveys.
No semantic understanding: These methods treat words as independent symbols. They can't understand that "car" and "automobile" are synonyms, or that "bank" can mean a financial institution or a river edge.
Vocabulary explosion: As document collections grow, vocabularies can become extremely large (hundreds of thousands of words), leading to high-dimensional, sparse vectors that are computationally expensive.
Context insensitivity: The same word always gets the same representation, regardless of context. "Apple" in "Apple stock price" and "apple pie recipe" are treated identically.
Fixed vocabulary: New words not seen during vocabulary building are ignored. This makes the system brittle when encountering domain-specific terminology or evolving language.

Despite these limitations, Bag of Words and TF-IDF remain valuable tools. They're fast, interpretable, and work well as baselines or feature extractors for downstream models.

Impact and Applications

These classical techniques have had enormous impact and continue to be used in production systems:

Search engines: Early web search (including early Google) relied heavily on TF-IDF for ranking. The PageRank algorithm combined link analysis with TF-IDF-based content analysis.
Document classification: Email spam filters, news categorization, and sentiment analysis systems often use TF-IDF features with classifiers like Naive Bayes or Support Vector Machines.
Information retrieval: Library systems, legal document search, and academic paper search engines use TF-IDF to match queries to relevant documents.
Feature engineering: Even in the era of neural networks, TF-IDF vectors are often concatenated with learned embeddings as input features, combining classical and modern approaches.
Baseline comparisons: New NLP methods are typically compared against TF-IDF baselines to demonstrate improvement.
Interpretability: Unlike black-box neural models, TF-IDF scores are directly interpretable. You can see exactly which words contribute to a document's representation and why.

The simplicity and effectiveness of these methods make them excellent starting points for text analysis. They teach us fundamental concepts about text representation that carry forward to more advanced techniques.

Summary

Bag of Words and TF-IDF provide foundational methods for converting text into numerical representations that machine learning algorithms can process.

Key takeaways:

Bag of Words represents documents as fixed-length vectors of word counts, discarding word order but enabling mathematical operations on text
Term Frequency (TF) measures how often a word appears in a document, often normalized by document length
Inverse Document Frequency (IDF) measures how distinctive a word is across a collection, downweighting common words
TF-IDF combines both components, emphasizing words that are frequent in specific documents but rare overall
These methods are fast, interpretable, and effective for many tasks, but lose word order and semantic relationships

When to use:

Building search systems or information retrieval applications
Creating baseline models for text classification
Feature engineering for downstream machine learning models
Situations where interpretability matters more than state-of-the-art performance

What's next:

While Bag of Words and TF-IDF solve the fundamental problem of text representation, they're just the beginning. In the next chapters, we'll explore word embeddings that capture semantic relationships, sequence models that preserve word order, and transformer architectures that understand context. Each builds on these classical foundations while addressing their limitations.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about TF-IDF and Bag of Words.

Loading component...

Back to Language AI Handbook

Previous Chapter

Word Embeddings

Next Chapter

Named Entity Recognition

Coming Soon

Reference

BIBTEXAcademic

@misc{tfidfandbagofwordscompleteguidetotextrepresentationinformationretrieval, author = {Michael Brenndoerfer}, title = {TF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval}, year = {2025}, url = {https://mbrenndoerfer.com/writing/tf-idf-bag-of-words-text-representation-information-retrieval}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-30} }

APAAcademic

Michael Brenndoerfer (2025). TF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval. Retrieved from https://mbrenndoerfer.com/writing/tf-idf-bag-of-words-text-representation-information-retrieval

MLAAcademic

Michael Brenndoerfer. "TF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval." 2025. Web. 11/30/2025. <https://mbrenndoerfer.com/writing/tf-idf-bag-of-words-text-representation-information-retrieval>.

CHICAGOAcademic

Michael Brenndoerfer. "TF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval." Accessed 11/30/2025. https://mbrenndoerfer.com/writing/tf-idf-bag-of-words-text-representation-information-retrieval.

HARVARDAcademic

Michael Brenndoerfer (2025) 'TF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval'. Available at: https://mbrenndoerfer.com/writing/tf-idf-bag-of-words-text-representation-information-retrieval (Accessed: 11/30/2025).

SimpleBasic

Michael Brenndoerfer (2025). TF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval. https://mbrenndoerfer.com/writing/tf-idf-bag-of-words-text-representation-information-retrieval

Direct link:

https://mbrenndoerfer.com/writing/tf-idf-bag-of-words-text-representation-information-retrieval

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

View full handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications

InteractiveTF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval