Lesson 23: Text Data & NLP Basics

Welcome to Natural Language Processing (NLP)!

The challenge with text is that computers only understand numbers, not words. To apply Machine Learning to text, we must first convert the text into numbers.

The Text Preprocessing Pipeline

Before converting text to numbers, we usually clean it up. The standard pipeline looks like this:

Lowercasing: Making everything lowercase so "Apple" and "apple" are treated the same.
Removing Punctuation: Stripping out commas, periods, etc.
Tokenization: Splitting a sentence into a list of individual words (tokens).
Removing Stop Words: Dropping common words like "the", "is", and "a" that don't add much meaning.

Bag of Words (BoW) & TF-IDF

Once text is tokenized, we represent it using numbers. The simplest way is a Bag of Words, where we just count the frequency of each word.

A smarter approach is TF-IDF (Term Frequency - Inverse Document Frequency). It gives more weight to words that appear frequently in a specific document but rarely across all documents. For example, in a movie review, "cinematography" is much more informative than "movie".

Coding Challenge: Tokenization

Let's practice some basic string manipulation in Python to tokenize a sentence.

Create a variable review = "The movie was absolutely fantastic!".
Convert the review to lowercase using review.lower() and store it.
Remove the exclamation mark by using .replace("!", "").
Split the string into a list of words using .split() and print the list.