Member-only story
Preprocessing Text Data for Machine Learning
Machine Learning models cannot work directly with text data. you need to encode your text data in some numeric form.
Any text document is essentially just a sequence of words which you can tokenize into individual words, After transforming your document into a sequence or list of words, you can encode and represent each word in a numeric form using somekind of numeric encoding.
Once you get the numeric representation for each word in your document, you aggregate your data into a tensor, Now the question is how do you transform individual words into numeric form.
there are diffrent techniques to encode text in form of numbers you can use :
One hot encoding :
represent each word in text by the presence or absence of a word. The size of the feature vector to represent a word is equal to the size of your vocabulary, Each word will have a corresponding position in that feature vector, use 1 to indicate the word is present or 0 to indicate the word is absent. One-hot encoding has several flaws. In a sense, you have no idea how often a particular word occurred in text, you only know whether a word was present or absent, and improvement over one-hot encoding is frequency-based encodings.
Frequency-based numeric representation :
Frequency-based numeric encodings can be divided into three broad categories:
- Count based encodings : uses the numbers in a feature vector to represent a count of how often a word occurs in a document. These captures the frequency of a word in a particular document, and this is important because the frequency may indicate the importance of a word in a document.
- TF-IDF : An improvement over count-based feature vectors are feature vectors built using TF-IDF scores. TF stands for term frequency and IDF stands for inverse document frequency. TF-IDF scores try to capture how often a word occurs in a document, as well as across the entire corpus. TF- score up-weighs words that occurs more frequently in one document. If a word occurs more frequently in a single document, that word might be important. IDF scores tend to down-weigh words that occur frequently across the corpus. Stop words such as “ a, an, the” might occur frequently across…