Welcome to Natural Language Processing (NLP)!
The challenge with text is that computers only understand numbers, not words. To apply Machine Learning to text, we must first convert the text into numbers.
Before converting text to numbers, we usually clean it up. The standard pipeline looks like this:
Once text is tokenized, we represent it using numbers. The simplest way is a Bag of Words, where we just count the frequency of each word.
A smarter approach is TF-IDF (Term Frequency - Inverse Document Frequency). It gives more weight to words that appear frequently in a specific document but rarely across all documents. For example, in a movie review, "cinematography" is much more informative than "movie".
Let's practice some basic string manipulation in Python to tokenize a sentence.
review = "The movie was absolutely fantastic!".review.lower() and store it..replace("!", "")..split() and print the list.