What is stemming and lemmatization?
In text preprocessing, stemming and lemmatization are similar to one another, and are assumed to be under the umbrella of morphological word normalization. Morphology is from linguistics, and involves the study of how words are formed from smaller pieces, called morphemes. For example, cats contains two morphemes, cat and s.
Stemming
Stemming is the process of transforming a word into its original stem. Porter’s algorithms is the most common Stemming algorithm for English, and has proven to be the most accurate to date. This process sometimes cuts up words arbitrarily, and may not be very accurate. But it is used to transform words to their stem usefully in machine learning, in order to convert multiple variations of a word to some certain stem. This one stem can then be used to represent all of the variations it forms, which can be a good way to reduces the number of features for ML, thus reducing time and cpu constraints. For example combine, combines, and combining are all transformed to combin. Clearly this is not found in the dictionary, but is useful for certain tasks where only a representation of multiple word variations is needed, and does not necessarily have to be a real word.
Lemmatization
Lemmatization is very similar to stemming, but it instead does not arbitrarily cut up words, but rather transforms the word into its original stem, as found in a dictionary. It removes inflectional endings and leaves just the original form of the word, called a lemma. For example, combine, combines, and combining are all transformed to combine. This is useful for tasks that require an analysis of real word forms. So this is a better option to use than the previous, and is more realistic.
Performing stemming and lemmatization in NLTK
The following code uses NLTK in Python to perform stemming on a set of words.
import nltk from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() text = "play plays played playing" tokenized_text = nltk.word_tokenize(text) for word in tokenized_text: print("Stemming {} to {}".format(word, stemmer.stem(word)))
Output:
Stemming play to play Stemming plays to play Stemming played to play Stemming playing to play
Next, we perform lemmatization using WordNetLemmatizer.
import nltk from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() text = "play plays played playing player eating party parties" tokens = nltk.word_tokenize(text) for word in tokens: print("Lemmatizing {} to {}".format(word, lemmatizer.lemmatize(word)))
Output:
Lemmatizing play to play Lemmatizing plays to play Lemmatizing played to played Lemmatizing playing to playing Lemmatizing player to player Lemmatizing eating to eating Lemmatizing party to party Lemmatizing parties to party
That’s it!
Please inform me if I missed something in the comments! The aim is to learn together! 🙂
3