Preprocessing Text Data for Machine Learning

Machine Learning models cannot work directly with text data. you need to encode your text data in some numeric form.
  • TF-IDF : An improvement over count-based feature vectors are feature vectors built using TF-IDF scores. TF stands for term frequency and IDF stands for inverse document frequency. TF-IDF scores try to capture how often a word occurs in a document, as well as across the entire corpus. TF- score up-weighs words that occurs more frequently in one document. If a word occurs more frequently in a single document, that word might be important. IDF scores tend to down-weigh words that occur frequently across the corpus. Stop words such as “ a, an, the” might occur frequently across the corpus. These words are not significant, and should be down-weighed. The TF-IDF score for a word is a combination of these two components. Both TF-IDF and count vectors do not capture the context surrounding a word. Co-occurrence matrices do.
  • Co-occurence : Co-occurrence matrices are generated on the principle that similar words will occur together and will have similar context. Co-occurrence matrices for word representations are generated using something called a context window, a window centered around a word which includes a certain number of neighboring words. The co-occurrence matrix will try to capture word co-occurrences, the number of times two words ( word1 and word2) have occurred together within that context window.

Bag of words and Bag of N-grams Models :

there are two kinds of bag-based models. You can have a bag of words or you can have your text represented as a bag-of-n-grams.

  • The bag-of-n-grams model : is an extension of the bag-of-words, except that it’s a model that represents the documents as a multiset of its constituent n-grams. disregarding the order of n-grams, but maintain multiplicity, the bag-of-n-grams representation have a few advantages over the traditional bag of words. An n-gram model can store additional spatial information for a word. You might find that machine learning models that work with bag-of-n-grams might give you better performance than when they work with simpler bag-of-words model

Vectorize Text Using the Bag-of-words Model :

The scikit-learn estimator that we’ll use here is the CountVectorizer. The CountVectorizer estimator object in scikit-learn generates frequency-based encodings for your text data

import sklearnfrom sklearn.feature_extraction.text import CountVectorizer
 text = ["Hire yourself and start calling the shots..",
"Don’t Let Yesterday Take Up Too Much Of Today. ",
"Hold the vision, trust the process",
"Whatever you are, be a good one.",
"Impossible is just an opinion."]
count_vectorizer = CountVectorizer() # call countVectorizeron on the text
transformed_vector = count_vectorizer.transform(text)
test_text = ["This sentence is not in the  train text."]count_vectorizer.transform(test_text).toarray()

Data Engineer and Machine learning enthusiast with a great intrest in cloud technologies

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store