Preprocessing Text Data for Machine Learning

5 min readJan 5, 2021

Machine Learning models cannot work directly with text data. you need to encode your text data in some numeric form.

Any text document is essentially just a sequence of words which you can tokenize into individual words, After transforming your document into a sequence or list of words, you can encode and represent each word in a numeric form using somekind of numeric encoding.

Once you get the numeric representation for each word in your document, you aggregate your data into a tensor, Now the question is how do you transform individual words into numeric form.

there are diffrent techniques to encode text in form of numbers you can use :

One hot encoding :

represent each word in text by the presence or absence of a word. The size of the feature vector to represent a word is equal to the size of your vocabulary, Each word will have a corresponding position in that feature vector, use 1 to indicate the word is present or 0 to indicate the word is absent. One-hot encoding has several flaws. In a sense, you have no idea how often a particular word occurred in text, you only know whether a word was present or absent, and improvement over one-hot encoding is frequency-based encodings.

Frequency-based numeric representation :

Frequency-based numeric encodings can be divided into three broad categories:

  • Count based encodings : uses the numbers in a feature vector to represent a count of how often a word occurs in a document. These captures the frequency of a word in a particular document, and this is important because the frequency may indicate the importance of a word in a document.
  • TF-IDF : An improvement over count-based feature vectors are feature vectors built using TF-IDF scores. TF stands for term frequency and IDF stands for inverse document frequency. TF-IDF scores try to capture how often a word occurs in a document, as well as across the entire corpus. TF- score up-weighs words that occurs more frequently in one document. If a word occurs more frequently in a single document, that word might be important. IDF scores tend to down-weigh words that occur frequently across the corpus. Stop words such as “ a, an, the” might occur frequently across…

Data Engineer and Machine learning enthusiast with a great intrest in cloud technologies