Preprocessing for Natural Language Processing

6 min readJul 3, 2020

Raw data for natural language processing is received from lot of sources and it needs to be cleaned and preprocessed before applying any model to the data.

Following are some of methods used in pre- processing

Tokenization

Stop words removal

Stemming

Normalization

Lemmatization

Parts of speech tagging

Text Enrichment or Augmentation

Word Embedding or Text Vectors

Stop Words Removal

Stop word removal is one of the preprocessing techniques in the NLP. Stop words are the most commonly occurring words in English or any language like a, the, in, on etc. When we do a text classification these words do not add much value. These stop words could be removed so that the number of tokens are reduced and performance of the model improves.

There are number of ways of removing the stop words with different libraries.

Stop words removal with NLTK

Stop words removal with Gensim

Stop words removal reduces the amount of data and increases the performance of model. It also reduces the time taken for training.

Stop words removal can be problematic when the context is affected. When the sequence to sequence models are used then removing the stop words can affect the context. Stop word removal need to be avoided when we are generating models for sentiment analysis, language translation, question answer models.

Stop words can be used when we are doing text classification models. It is mostly employed when classical NLP techniques such as Bag of Words, TF-IDF models are used. Stop word removal is not used when the deep learning , sequence to sequence models like RNN, LSTM are used.

Stemming and Lemmatization

In English a lot of the words are derived from a root word. For example recovery, recovers, recover, recovered, recovering are derived from the root word recover.

Stemming and lemmatization are the processes of removing the letters from the word to match the root word. Stemming and lemmatization reduce the inflectional forms. This reduces the number of tokens so it improves the performance of model.

Stemming follows a crude heuristic way in which the letters in the end are chopped off. The word which remains may not be an exact English word. In the above example stem of the tried becomes tri after removing the letters ‘ed’ but ‘trying’ becomes ‘try’.

Stemming may get exact English word in some occasions.

Lemmatization uses vocabulary and context of the word to get a proper lemma (word) when the lemmatization is done.

Stemming is faster than lemmatization as the chopping of the letters is done but in case of lemmatization there needs to be a search in corpus for a proper word.

Stemming with NLTK

There are 4 types of stemmer porter stemmer, Lancaster stemmer, regexp stemmer and snowball stemmer depending on the algorithm used for stemming.

Lancaster Stemmer

This is stemmer based on Lancaster (Paice/Husk) stemming algorithm

Regexp Stemmer

This stemmer uses regular expression to chop the suffix. This takes the custom suffixes given as inputs.

Porter Stemmer

This is a stemmer based on Porter stemming algorithm . You can select the mode for the stemmer. You can either select ORIGINAL_ALGORITM, MARTIN_EXTENIONS, NLTK_EXTENSIONS. NLTK extension is taken as default.

Snowball Stemmer

This stemmer provides a port of Snowball stemmers developed by Martin Porter. This has stemmers for different languages.

Stemming with genism

Gensim only supports porter stemmer

Lemmatization with NLTK

Normalization

Normalization is another method of cleaning the text before the model is applied. Normalization involves cleaning of data like removing #tags, removing the prefixed or suffixed unnecessary data like dots, converting the case of the text data to lower / upper case. When the data is normalized the performance of stemming and lemmatization also improve. Following figure below shows some methods of normalization of data.

Conversion of Text to lower or upper case

Text data received can have data with mixed case like lower or upper case. It is general to convert the case of text to lower case. This method which makes processing simple without having to worry about taking care of Capitalization of words or letters in the text. However care should be taken for special cases like US and us, since these two words have different meaning and will become same after lowercasing.

Transformation to Canonical Form

Lot of text data generally in tweets is received in short forms like 2morrow for tomorrow or there can be data ‘: )’ for smile etc. These need to be converted to normal forms. This method allows the text to be transformed into some standard form. As part of the process it allows words with same meaning but with different spellings to be transformed into single word. For example, goood, gud, should be transformed to good. Cross-words, crossword, cross words should be transformed to “crosswords”. It removes extra white spaces. It transforms numbered words to numeric form.

Text Noise Removal

There can be lot of prefixed or suffixed data which needs to be removed. Noise refers to all text which does not contribute to text analysis like digits, punctuation, special characters. They may interfere with text analysis and hence removing noise helps the NLP models to perform better and improve accuracy. It removes extra white spaces, html tags, and unknown characters.

Correction of Spelling mistakes

There can also lot of spelling mistakes in data which needs to be corrected.

Handling of numerical text

The numbers in the text data need to be either removed or converted into a word.

For example:

Regular expression in python can be used for normalization.

Text Enrichment or Augmentation

This method involves in extending the data set using few operations based on the data type of data in the data set. It can be achieved by enriching the data set with information which previously did not exist. It provides more semantics to the text and hence improves the predictive accuracy. The data can be augmented using external knowledge sources, sentence or word shuffling, word replacement, and adding new words based on context.

Tokenization

This method allows the sentences to be changed to words, paragraph to sentences to words where each word can be treated as token and be used for processing in NLP.

Word Embedding or Text Vectors

This is the modern way of representing words as vectors. It allows the high dimensional word features in text to be represented as word vectors. The words are represented by vectors using suitable methods which maintain the relationship of words. That means if the vectors are represented in co-ordinate system, the words closer will be placed closer in the representation.

Authors

Srinivas Chakravarthy : srinivas.yeeda@gmail.com

Chandrasekhar B N : chandru4ni@gmail.com

Preprocessing for Natural Language Processing

Written by Srinivas Chakravarthy

No responses yet