Preprocessing Text Data for Machine Learning
Machine Learning models cannot work directly with text data. you need to encode your text data in some numeric form.
Any text document is essentially just a sequence of words which you can tokenize into individual words, After transforming your document into a sequence or list of words, you can encode and represent each word in a numeric form using somekind of numeric encoding.
Once you get the numeric representation for each word in your document, you aggregate your data into a tensor, Now the question is how do you transform individual words into numeric form.
there are diffrent techniques to encode text in form of numbers you can use :
One hot encoding :
represent each word in text by the presence or absence of a word. The size of the feature vector to represent a word is equal to the size of your vocabulary, Each word will have a corresponding position in that feature vector, use 1 to indicate the word is present or 0 to indicate the word is absent. One-hot encoding has several flaws. In a sense, you have no idea how often a particular word occurred in text, you only know whether a word was present or absent, and improvement over one-hot encoding is frequency-based encodings.