Preprocessing Text Data for Machine Learning

AnisKHELOUFI
5 min readJan 5, 2021

Machine Learning models cannot work directly with text data. you need to encode your text data in some numeric form.

https://pixabay.com/fr/users/gdj-1086657/

Any text document is essentially just a sequence of words which you can tokenize into individual words, After transforming your document into a sequence or list of words, you can encode and represent each word in a numeric form using somekind of numeric encoding.

Once you get the numeric representation for each word in your document, you aggregate your data into a tensor, Now the question is how do you transform individual words into numeric form.

there are diffrent techniques to encode text in form of numbers you can use :

One hot encoding :

represent each word in text by the presence or absence of a word. The size of the feature vector to represent a word is equal to the size of your vocabulary, Each word will have a corresponding position in that feature vector, use 1 to indicate the word is present or 0 to indicate the word is absent. One-hot encoding has several flaws. In a sense, you have no idea how often a particular word occurred in text, you only know whether a word was present or absent, and improvement over one-hot encoding is frequency-based encodings.

--

--

AnisKHELOUFI

Data Engineer and Machine learning enthusiast with a great intrest in cloud technologies