What is POS Tagging in NLP?
Part of Speech (POS) Tagging is basically taking a string of words, and tagging them with a part of speech based on their context in the text. This can also be called word class disambiguation or grammatical tagging. This is useful in models such as bag of words, which do not take into consideration the surrounding context of the words. POS tagging would help to at least provide some information about the lexical category of words, which can be useful features for some machine learning classifier models. For example, in sentiment analysis, this would help to distinguish between the adjective good, as in well, which has a positive sentiment, and the noun good, as in product, which has a neutral sentiment. POS tagging is also useful for Named Entity Recognition, since POS tags are important in order to extract useful named entities such as proper nouns.
Techniques for POS Tagging
Using some training samples, this would simply assign a word with the POS that it occurs with the most frequently in the training data.
Rule based techniques
This is useful when no training data is available. It would follow a set of predefined rules to tag words. For example, a word that preceded with a determiner must be a noun, a word that ends with ed must be a verb, and a word that ends with ly must be an adverb, and so on. Brill’s tagger is an early POS tagger that uses a rule-based technique.
This tags words based on the probability of the sequence of words coming up next. Hidden Markov models work best with this technique.
Neural network and deep learning techniques
This technique is the most advanced, and uses recurring neural networks to tag words.
POS Tagging with NLTK
This can be easily done using NLTK, as shown in the code below.
import nltk from nltk.tokenize import word_tokenize text = word_tokenize("""Whether you're just new a new programmer who is just learning \ programming, or even an experienced developer, it's very easy to learn \ and code in Python. It's an awesome tool for text mining too!""") print(nltk.pos_tag(text))
[('Whether', 'IN'), ('you', 'PRP'), ("'re", 'VBP'), ('just', 'RB'), ('new', 'JJ'), ('a', 'DT'), ('new', 'JJ'), ('programmer', 'NN'), ('who', 'WP'), ('is', 'VBZ'), ('just', 'RB'), ('learning', 'VBG'), ('programming', 'VBG'), (',', ','), ('or', 'CC'), ('even', 'RB'), ('an', 'DT'), ('experienced', 'JJ'), ('developer', 'NN'), (',', ','), ('it', 'PRP'), ("'s", 'VBZ'), ('very', 'RB'), ('easy', 'JJ'), ('to', 'TO'), ('learn', 'VB'), ('and', 'CC'), ('code', 'VB'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('an', 'DT'), ('awesome', 'JJ'), ('tool', 'NN'), ('for', 'IN'), ('text', 'NN'), ('mining', 'NN'), ('too', 'RB'), ('!', '.')]
PunktSentenceTokenizer can also be used when you need to train the POS tagger for a corpus in a specific domain. This would help to obtain higher accuracy, if enough training data is available.
Please inform me if I missed something in the comments! The aim is to learn together! 🙂