How to Generate a Wordcloud using Matplotlib
* 3 min read

A wordcloud is a graphical word display of the most frequently occurring words in a body of text. The size of each word displayed is based on its frequency, so the size of words that are very frequent are larger than others. It is simply a visual view of the most prominent terms.

 

The input is a small article that talks about NLP, as shown below.

What is Natural Language Processing (NLP)?
NLP is the use of computer science, AI, and computational linguistics to allow computers to interact with and understand natural language. By natural language, I mean any human-spoken language that exists in the world, such as English. In its simplest form, it enables computers and artificially intelligent machines to understand and manipulate (usually large amounts of) natural language data. It is a discipline that concentrates on the direct interaction between natural language and data sciences, and has been applied in every domain available.
To a computer, natural language is just a bunch words (strings), each of which is composed of a bunch of characters. Computers can’t understand this, since it is meant for human consumption. Computers need to process this text in order to better understand it. There are generally three main levels of processing, namely, lexical processing, syntactic processing and semantic processing.
Lexical processing involves the investigation of words in a text, to better understand that text. This is called a bag of words approach, where words are taken out of context, and the surrounding context of each word is lost.
Syntactic processing involves understanding the syntactic or grammatical structure in order to help understand a sentence. This allows for more details than lexical processing, since syntactic information can help to add more meaning to words. For example, a determiner in front of a certain word means that it is most likely a noun. Adding syntactic information reduces ambiguity.
Semantic processing is the final level, and involves using semantic knowledge bases to infer the correct meaning of the word based on the meanings of other words surrounding it. To allow computers to obtain as much information as possible about natural language, semantic processing is crucial, but is the most complex level of text processing.
NLP overlaps with other areas such as computational linguistics, textual data mining, and text analytics. It largely overlaps with computational linguistics, which is using computers to study all aspects of human language, including semantics, syntax/grammar and phonetics. Even since the Web has become read-write, NLP has been an area for both academicians and industries. It opens new challenges in research, and has proven to be very valuable in decision-making by for-profit organizations.

 

In the code below, we use the wordcloud library to generate the words in the wordcloud, and to remove stopwords, which are not relevant to us. We use matplotlib to automatically generate a small wordcloud based on the words in this text.

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plot

with open("nlp.txt", 'r') as f:  
    corpus = f.read()

#check contents
print("File data sample : ", corpus[:500])

# generate list of irrelevent stopwords

stopwords = set(STOPWORDS)

# generate the wordcloud
wordcloud = WordCloud(stopwords=stopwords, max_words=40, \
                      background_color="white").generate(corpus)
plot.imshow(wordcloud)
plot.axis("off")
plot.show()

 

The output is a graphical image of the wordcloud, based on the input body of text. As can be seen, prominent words in this text are Natural Language Processing, NLP, Semantic, Syntactic, Information, Data, Text, and other words that are related to the topic of NLP. You can try this out using your own text.

nlp wordcloud

 

1

Mohammad D.

Mohammad D. works with sentiment anlaysis, NLP and Python. He loves to blog about these and other related topics in his free time.
LinkedIn

Leave a Reply