How to Perform Tokenization for NLP with Python (2020)
* 5 min read

 

What is tokenization in NLP?

Natural language processing is a multi-step process, and before we input our text data into any NLP model for tasks such as text classification, sentiment analysis, translation, chat bots, etc., we must clean it. Preprocessing or cleansing text data is essential before the the text can be analyzed by our model. In fact, preprocessing is crucial, since textual data is generally unstructured, and sometimes messy. One of the steps of preprocessing is tokenization. Tokenization is simply the process of splitting up a piece of text into smaller text units called ‘tokens’. This is a very common initial step in the pipeline. So an entire text document is split up into paragraph tokens, each of which is split up into sentence tokens. Each sentence can then be split up into word tokens. Words can even be split up into character tokens in some cases. For now, we will just deal with splitting up text into word tokens, the most common task when working with sentiment analysis, text classification and other popular text mining tasks.

 

Why is tokenization important?

Think of it this way. Before we can analyze any natural language content, we need to recognize the individual words of the text. A word is a string of characters. This is so important because in order to understand the meaning underlying the text, we need to interpret the words available within it.

 

How to perform word tokenization?

There are multiple easy and quick ways we can practically perform text tokenization using Python. We can use some default features in Python without the need for any additional modules, or we can use one of the popular NLP modules such as NLTK, TextBlob or Gensim, among others. This is a matter of personal preference; you can choose whichever module you work with.

 

Word tokenization using Regular expressions in Python 

Regular expressions are very useful for matching a certain pattern of strings in a text. This has many uses in text mining, since it allows us to detect and manipulate specific patterns. Let’s say we needed to identify only email addresses from within a text, or maybe only postal addresses, or phone numbers, or particular named entities. This can be done easily with the ‘re’ library, which comes with Python by default. In this case, we focus on tokenization using re, but it is a powerful feature that can do much more. The code below uses re to tokenize the target text.

import re
text = "Whether you're just new a new programmer who is just learning \
programming, or even an experienced developer, it's very easy to learn \
and code in Python. It's an awesome tool for text mining too!"
tokens = re.findall("[\w']+", text)
print(tokens)

Output:

['Whether', "you're", 'just', 'new', 'a', 'new', 'programmer', 'who', 'is', 'just', 'learning', 'programming', 'or', 'even', 'an', 'experienced', 'developer', "it's", 'very', 'easy', 'to', 'learn', 'and', 'code', 'in', 'Python', "It's", 'an', 'awesome', 'tool', 'for', 'text', 'mining', 'too']

 

Word tokenization using split() method in Python 

Using Python’s basic split() function is not the most accurate approach, but it can give us an idea about what to expect by using the default features available to us by default.

text = "Whether you're just new a new programmer who is just learning \
programming, or even an experienced developer, it's very easy to learn \
and code in Python. It's an awesome tool for text mining too!"
print(text.split())

Output:

['Whether', "you're", 'just', 'new', 'a', 'new', 'programmer', 'who', 'is', 'just', 'learning', 'programming,', 'or', 'even', 'an', 'experienced', 'developer,', "it's", 'very', 'easy', 'to', 'learn', 'and', 'code', 'in', 'Python.', "It's", 'an', 'awesome', 'tool', 'for', 'text', 'mining', 'too!']

 

Word tokenization using Python NLTK’s word_tokenize()

This comes with NLTK, and works as a wrapper to the TreebankWordTokenizer() function, which is trained to split sentences into words by using white space and punctuation. It does not remove the punctuation, but treats punctuation symbols as tokens as well. You then decide whether you want to discard the punctuation or include it in analysis.

from nltk.tokenize import word_tokenize
text = """Whether you're just new a new programmer who is just learning \
programming, or even an experienced developer, it's very easy to learn \
and code in Python. It's an awesome tool for text mining too!"""
print(word_tokenize(text))

Output:

['Whether', 'you', "'re", 'just', 'new', 'a', 'new', 'programmer', 'who', 'is', 'just', 'learning', 'programming', ',', 'or', 'even', 'an', 'experienced', 'developer', ',', 'it', "'s", 'very', 'easy', 'to', 'learn', 'and', 'code', 'in', 'Python', '.', 'It', "'s", 'an', 'awesome', 'tool', 'for', 'text', 'mining', 'too', '!']

As you can see, it does not do a good job at tokenizing words with an apostrophe. It tokenizes it’s as it’ and s.

 

Word tokenization using Python TextBlob 

It is very easy to use TextBlob by first converting the string of text into a TextBlob object.

from textblob import TextBlob
text = "Whether you're just new a new programmer who is just learning \
programming, or even an experienced developer, it's very easy to learn \
and code in Python. It's an awesome tool for text mining too!"
text = TextBlob(text)
print(text.words)

Output:

['Whether', 'you', "'re", 'just', 'new', 'a', 'new', 'programmer', 'who', 'is', 'just', 'learning', 'programming', 'or', 'even', 'an', 'experienced', 'developer', 'it', "'s", 'very', 'easy', 'to', 'learn', 'and', 'code', 'in', 'Python', 'It', "'s", 'an', 'awesome', 'tool', 'for', 'text', 'mining', 'too']

 

Word tokenization using Python Gensim

Gensim is also a good choice.

from gensim.utils import tokenize
text = "Whether you're just new a new programmer who is just learning \
programming, or even an experienced developer, it's very easy to learn \
and code in Python. It's an awesome tool for text mining too!"
print(list(tokenize(text)))

Output:
['Whether', 'you', 're', 'just', 'new', 'a', 'new', 'programmer', 'who', 'is', 'just', 'learning', 'programming', 'or', 'even', 'an', 'experienced', 'developer', 'it', 's', 'very', 'easy', 'to', 'learn', 'and', 'code', 'in', 'Python', 'It', 's', 'an', 'awesome', 'tool', 'for', 'text', 'mining', 'too']

 

Word tokenization using Python SpaCy

SpaCy requires a few more lines of code, but this tool goes beyond basic NLP. For now, we just need to use it for tokenization.

from spacy.lang.en import English
# get spaCy NLP tools
nlp_tools = English()
text = "Whether you're just new a new programmer who is just learning \
programming, or even an experienced developer, it's very easy to learn \
and code in Python. It's an awesome tool for text mining too!"
# nlp_tools used to parse text using spaCy's features
text = nlp_tools(text)
#generate tokens from text
tokens = []
for token in text:
    tokens.append(token.text)
print(tokens)

Output:

['Whether', 'you', "'re", 'just', 'new', 'a', 'new', 'programmer', 'who', 'is', 'just', 'learning', 'programming', ',', 'or', 'even', 'an', 'experienced', 'developer', ',', 'it', "'s", 'very', 'easy', 'to', 'learn', 'and', 'code', 'in', 'Python', '.', 'It', "'s", 'an', 'awesome', 'tool', 'for', 'text', 'mining', 'too', '!']

 

That's about it! 

Here you have a bunch of different ways to dissect text into words for later analysis. Please tell me if I missed anything in the comments. The aim is to learn together! 🙂

 

1

Mohammad D.

Mohammad D. works with sentiment anlaysis, NLP and Python. He loves to blog about these and other related topics in his free time.
LinkedIn

Leave a Reply