Machine Learning models cannot work directly with text data. you need to encode your text data in some numeric form.
Any text document is essentially just a sequence of words which you can tokenize into individual words, After transforming your document into a sequence or list of words, you can encode and represent each word in a numeric form using somekind of numeric encoding.
Once you get the numeric representation for each word in your document, you aggregate your data into a tensor, Now the question is how do you transform individual words into numeric form.
there are diffrent techniques to encode text in form of numbers you can use :
One hot encoding :
represent each word in text by the presence or absence of a word. The size of the feature vector to represent a word is equal to the size of your vocabulary, Each word will have a corresponding position in that feature vector, use 1 to indicate the word is present or 0 to indicate the word is absent. One-hot encoding has several flaws. In a sense, you have no idea how often a particular word occurred in text, you only know whether a word was present or absent, and improvement over one-hot encoding is frequency-based encodings.
Frequency-based numeric representation :
Frequency-based numeric encodings can be divided into three broad categories:
- Count based encodings : uses the numbers in a feature vector to represent a count of how often a word occurs in a document. These captures the frequency of a word in a particular document, and this is important because the frequency may indicate the importance of a word in a document.
- TF-IDF : An improvement over count-based feature vectors are feature vectors built using TF-IDF scores. TF stands for term frequency and IDF stands for inverse document frequency. TF-IDF scores try to capture how often a word occurs in a document, as well as across the entire corpus. TF- score up-weighs words that occurs more frequently in one document. If a word occurs more frequently in a single document, that word might be important. IDF scores tend to down-weigh words that occur frequently across the corpus. Stop words such as “ a, an, the” might occur frequently across the corpus. These words are not significant, and should be down-weighed. The TF-IDF score for a word is a combination of these two components. Both TF-IDF and count vectors do not capture the context surrounding a word. Co-occurrence matrices do.
- Co-occurence : Co-occurrence matrices are generated on the principle that similar words will occur together and will have similar context. Co-occurrence matrices for word representations are generated using something called a context window, a window centered around a word which includes a certain number of neighboring words. The co-occurrence matrix will try to capture word co-occurrences, the number of times two words ( word1 and word2) have occurred together within that context window.
Prediction-based prediction :
an advanced technique for word vector representations are prediction-based embeddings. These prediction-based embeddings are numerical representations of text which capture the meaning of text, as well as semantic relationships, and these are generated using ML models such as neural networks . Word vector embeddings are low-dimensionality representations of words . They capture meaning. Word embeddings are able to capture relationships such as analogies. The relationship between king and queen is the same as that between man and woman or the relationship between Paris and France is similar to the relationship between London and England. Not only do word embeddings capture meaning, they also offer dramatic dimensionality reduction. The feature vectors used to represent words have very low dimensionality, and work very well in natural language processing applications.
Bag of words and Bag of N-grams Models :
there are two kinds of bag-based models. You can have a bag of words or you can have your text represented as a bag-of-n-grams.
Bag-based models refer to how we represent a document made up of individual words, and these words are encoded in numeric form using any of the techniques above .
- A bag-of-words model : represents the document as a multi-set of its constituent words. A set or a bag representation disregards the order of words, but maintains multiplicity, it keeps track of how often a word occurs in a document. Count vectorization, that is count-based feature vectors, and TF-IDF vectorization are both bag-of-words models. The order of the original words are lost in a document, however, counts or frequencies of the words are considered in these encodings.
- The bag-of-n-grams model : is an extension of the bag-of-words, except that it’s a model that represents the documents as a multiset of its constituent n-grams. disregarding the order of n-grams, but maintain multiplicity, the bag-of-n-grams representation have a few advantages over the traditional bag of words. An n-gram model can store additional spatial information for a word. You might find that machine learning models that work with bag-of-n-grams might give you better performance than when they work with simpler bag-of-words model
Vectorize Text Using the Bag-of-words Model :
The scikit-learn estimator that we’ll use here is the CountVectorizer. The CountVectorizer estimator object in scikit-learn generates frequency-based encodings for your text data
import sklearnfrom sklearn.feature_extraction.text import CountVectorizer
the text data we will work with :
text = ["Hire yourself and start calling the shots..",
"Don’t Let Yesterday Take Up Too Much Of Today. ",
"Hold the vision, trust the process",
"Whatever you are, be a good one.",
"Impossible is just an opinion."]
Instantiate the CountVectorizer estimator object, we use this CountVectorizer to tokenize the input text into words and convert those words to numeric representations.
count_vectorizer = CountVectorizer()
count_vectorizer.fit(text) # call countVectorizeron on the text
The get_feature_names function in your CountVectorizer estimator object gives you access to the vocabulary of your training text. Feature_names gives you all of the individual words that make up the vocabulary of your training corpus. The length of your vocabulary determines the size of the feature vectors generated by the CountVectorizer to represent your input text.
The vocabulary attribute on the count_vectorizer gives you a mapping from the word to the numeric ID representing that word
The transform function will create feature vectors from our training text, and we’ll store these feature vectors in the transformed_vector variable
transformed_vector = count_vectorizer.transform(text)
The feature vectors here are in the sparse vector format.we convert it to a dense vector array using toarray, a count of 1 indicates that a particular word occurs just once, a count of 2 indicates that that word occurs twice in that sentence
The count_vectorizer has an inverse_transform operation that allows you to regenerate the original sentence from its feature vectors, however, when you apply this inverse_transform you’ll find that the order of the words in the original sentence is lost
we use this count_vectorizer to transform text that contains words not present in the count_vectorizer’s vocabulary. we invoke the transform function on this test_text
test_text = ["This sentence is not in the train text."]count_vectorizer.transform(test_text).toarray()