How to Perform Part of Speech Tagging in NLP
* 2 min read

What is POS Tagging in NLP?

Part of Speech (POS) Tagging is basically taking a string of words, and tagging them with a part of speech based on their context in the text. This can also be called word class disambiguation or grammatical tagging. This is useful in models such as bag of words, which do not take into consideration the surrounding context of the words. POS tagging would help to at least provide some information about the lexical category of words, which can be useful features for some machine learning classifier models. For example, in sentiment analysis, this would help to distinguish between the adjective good, as in well, which has a positive sentiment, and the noun good, as in product, which has a neutral sentiment. POS tagging is also useful for Named Entity Recognition, since POS tags are important in order to extract useful named entities such as proper nouns.

 

Techniques for POS Tagging

Lexical techniques

Using some training samples, this would simply assign a word with the POS that it occurs with the most frequently in the training data.

 

Rule based techniques

This is useful when no training data is available. It would follow a set of predefined rules to tag words. For example, a word that preceded with a determiner must be a noun, a word that ends with ed must be a verb, and a word that ends with ly must be an adverb, and so on.  Brill’s tagger is an early POS tagger that uses a rule-based technique.

 

Probabilistic/stochastic techniques

This tags words based on the probability of the sequence of words coming up next. Hidden Markov models work best with this technique.

 

Neural network and deep learning techniques

This technique is the most advanced, and uses recurring neural networks to tag words.

 

POS Tagging with NLTK

This can be easily done using NLTK, as shown in the code below.

import nltk
from nltk.tokenize import word_tokenize
text = word_tokenize("""Whether you're just new a new programmer who is just learning \
programming, or even an experienced developer, it's very easy to learn \
and code in Python. It's an awesome tool for text mining too!""")
print(nltk.pos_tag(text))

 

Output:

[('Whether', 'IN'), ('you', 'PRP'), ("'re", 'VBP'), ('just', 'RB'), ('new', 'JJ'), ('a', 'DT'), ('new', 'JJ'), ('programmer', 'NN'), ('who', 'WP'), ('is', 'VBZ'), ('just', 'RB'), ('learning', 'VBG'), ('programming', 'VBG'), (',', ','), ('or', 'CC'), ('even', 'RB'), ('an', 'DT'), ('experienced', 'JJ'), ('developer', 'NN'), (',', ','), ('it', 'PRP'), ("'s", 'VBZ'), ('very', 'RB'), ('easy', 'JJ'), ('to', 'TO'), ('learn', 'VB'), ('and', 'CC'), ('code', 'VB'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('an', 'DT'), ('awesome', 'JJ'), ('tool', 'NN'), ('for', 'IN'), ('text', 'NN'), ('mining', 'NN'), ('too', 'RB'), ('!', '.')]

 

PunktSentenceTokenizer can also be used when you need to train the POS tagger for a corpus in a specific domain. This would help to obtain higher accuracy, if enough training data is available.

 

That’s it!
Please inform me if I missed something in the comments! The aim is to learn together! 🙂

 

3

Mohammad D.

Mohammad D. works with sentiment anlaysis, NLP and Python. He loves to blog about these and other related topics in his free time.
LinkedIn

Leave a Reply