How to Perform Stemming and Lemmatization in NLP
* 2 min read

What is stemming and lemmatization?

In text preprocessing, stemming and lemmatization are similar to one another, and are assumed to be under the umbrella of morphological word normalization. Morphology is from linguistics, and involves the study of how words are formed from smaller pieces, called morphemes. For example, cats contains two morphemes, cat and s.



Stemming is the process of transforming a word into its original stem. Porter’s algorithms is the most common Stemming algorithm for English, and has proven to be the most accurate to date. This process sometimes cuts up words arbitrarily, and may not be very accurate. But it is used to transform words to their stem usefully in machine learning, in order to convert multiple variations of a word to some certain stem. This one stem can then be used to represent all of the variations it forms, which can be a good way to reduces the number of features for ML, thus reducing time and cpu constraints. For example combine, combines, and combining are all transformed to combin. Clearly this is not found in the dictionary, but is useful for certain tasks where only a representation of multiple word variations is needed, and does not necessarily have to be a real word.



Lemmatization is very similar to stemming, but it instead does not arbitrarily cut up words, but rather transforms the word into its original stem, as found in a dictionary. It removes inflectional endings and leaves just the original form of the word, called a lemma. For example, combine, combines, and combining are all transformed to combine. This is useful for tasks that require an analysis of real word forms. So this is a better option to use than the previous, and is more realistic.


Performing stemming and lemmatization in NLTK

The following code uses NLTK in Python to perform stemming on a set of words.

import nltk
from nltk.stem.porter import PorterStemmer
stemmer  = PorterStemmer()
text = "play plays played playing"
tokenized_text = nltk.word_tokenize(text)
for word in tokenized_text:
    print("Stemming {} to {}".format(word, stemmer.stem(word)))  



Stemming play to play
Stemming plays to play
Stemming played to play
Stemming playing to play


Next, we perform lemmatization using WordNetLemmatizer.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
text = "play plays played playing player eating party parties"
tokens = nltk.word_tokenize(text)
for word in tokens:
    print("Lemmatizing {} to {}".format(word, lemmatizer.lemmatize(word)))  



Lemmatizing play to play
Lemmatizing plays to play
Lemmatizing played to played
Lemmatizing playing to playing
Lemmatizing player to player
Lemmatizing eating to eating
Lemmatizing party to party
Lemmatizing parties to party


That’s it!

Please inform me if I missed something in the comments! The aim is to learn together! 🙂



Mohammad D.

Mohammad D. works with sentiment anlaysis, NLP and Python. He loves to blog about these and other related topics in his free time.

Leave a Reply