How to remove stop words for NLP
* 5 min read

Why is stop words removal important?

In NLP and text mining tasks, an important first step is preprocessing or text cleansing. One of the most important things to during preprocessing is to remove any stop words from the dataset to be analyzed. Why do we remove stop words? Why not just leave them in there? A good answer to these questions is that stop words occur very frequently, and they are useless and irrelevant during analyzing the text. So they may drastically reduce the size of the dataset which would allow the model to focus only on the relevant portion of the dataset, hence, improving overall efficiency. Since text data coprora are usually very large, stop words removal is a great way to reduce the size of the dataset, without affecting the outcome at all. Even search engines typically ignore stop words. This shows that they do not really have a great impact on the outcome.

Consider the text “I am running late to school. It would be good to take a taxi.” This contains stop words such as “I, am, to, it, be, to, a”. These words, as you can see, take up a good amount of the total amount of words of the text, and it would certainly be a good idea to reduce the corpus size to a considerably lower size (about half in this case!).

 

Where can I find a list of stop words?

There are readily available stop words lists you can use, or a good way to make your own is to check the most frequently used terms that appear in the corpus, and simply remove the top X%. But it is crucial to remove only those that are irrelevant, such as determiners, prepositions, pronouns, etc. The NLTK module for Python contains reliable a list of stop words for many different languages. You can add or remove words to or from this list if required. The following code accesses the English stop words in NLTK, and prints this list (a list of 179 stopwords in total).

from nltk.corpus import stopwords
print(stopwords.words('english'))
print(len(stopwords.words('english')))

 

Output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
179

 

The following is the code to print the stop words list in Gensim.

import gensim
stopwords = gensim.parsing.preprocessing.STOPWORDS
print(stopwords)
print(len(stopwords))

Output:

frozenset({'her', 'during', 'among', 'thereafter', 'only', 'hers', 'in', 'none', 'with', 'un', 'put', 'hence', 'each', 'would', 'have', 'to', 'itself', 'that', 'seeming', 'hereupon', 'someone', 'eight', 'she', 'forty', 'much', 'throughout', 'less', 'was', 'interest', 'elsewhere', 'already', 'whatever', 'or', 'seem', 'fire', 'however', 'keep', 'detail', 'both', 'yourselves', 'indeed', 'enough', 'too', 'us', 'wherein', 'himself', 'behind', 'everything', 'part', 'made', 'thereupon', 'for', 'nor', 'before', 'front', 'sincere', 'really', 'than', 'alone', 'doing', 'amongst', 'across', 'him', 'another', 'some', 'whoever', 'four', 'other', 'latterly', 'off', 'sometime', 'above', 'often', 'herein', 'am', 'whereby', 'although', 'who', 'should', 'amount', 'anyway', 'else', 'upon', 'this', 'when', 'we', 'few', 'anywhere', 'will', 'though', 'being', 'fill', 'used', 'full', 'thru', 'call', 'whereafter', 'various', 'has', 'same', 'former', 'whereas', 'what', 'had', 'mostly', 'onto', 'go', 'could', 'yourself', 'meanwhile', 'beyond', 'beside', 'ours', 'side', 'our', 'five', 'nobody', 'herself', 'is', 'ever', 'they', 'here', 'eleven', 'fifty', 'therefore', 'nothing', 'not', 'mill', 'without', 'whence', 'get', 'whither', 'then', 'no', 'own', 'many', 'anything', 'etc', 'make', 'from', 'against', 'ltd', 'next', 'afterwards', 'unless', 'while', 'thin', 'beforehand', 'by', 'amoungst', 'you', 'third', 'as', 'those', 'done', 'becoming', 'say', 'either', 'doesn', 'twenty', 'his', 'yet', 'latter', 'somehow', 'are', 'these', 'mine', 'under', 'take', 'whose', 'others', 'over', 'perhaps', 'thence', 'does', 'where', 'two', 'always', 'your', 'wherever', 'became', 'which', 'about', 'but', 'towards', 'still', 'rather', 'quite', 'whether', 'somewhere', 'might', 'do', 'bottom', 'until', 'km', 'yours', 'serious', 'find', 'please', 'hasnt', 'otherwise', 'six', 'toward', 'sometimes', 'of', 'fifteen', 'eg', 'just', 'a', 'me', 'describe', 'why', 'an', 'and', 'may', 'within', 'kg', 'con', 're', 'nevertheless', 'through', 'very', 'anyhow', 'down', 'nowhere', 'now', 'it', 'cant', 'de', 'move', 'hereby', 'how', 'found', 'whom', 'were', 'together', 'again', 'moreover', 'first', 'never', 'below', 'between', 'computer', 'ten', 'into', 'see', 'everywhere', 'there', 'neither', 'every', 'couldnt', 'up', 'several', 'the', 'i', 'becomes', 'don', 'ie', 'been', 'whereupon', 'seemed', 'most', 'noone', 'whole', 'must', 'cannot', 'per', 'my', 'thereby', 'so', 'he', 'name', 'co', 'its', 'everyone', 'if', 'become', 'thick', 'thus', 'regarding', 'didn', 'give', 'all', 'show', 'any', 'using', 'on', 'further', 'around', 'back', 'least', 'since', 'anyone', 'once', 'can', 'bill', 'hereafter', 'be', 'seems', 'their', 'myself', 'nine', 'also', 'system', 'at', 'more', 'out', 'twelve', 'therein', 'almost', 'except', 'last', 'did', 'something', 'besides', 'via', 'whenever', 'formerly', 'cry', 'one', 'hundred', 'sixty', 'after', 'well', 'them', 'namely', 'empty', 'three', 'even', 'along', 'because', 'ourselves', 'such', 'top', 'due', 'inc', 'themselves'})
337

 

Removing stop words

We remove stop words using NLTK in the example below.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = """I am interested in learning the many aspects of data science. \
I also want to learn natural language processing tasks which can be \
applied practically and help to solve real problems."""
stopwords_list = set(stopwords.words('english'))
text_tokens = word_tokenize(text)
  
cleansed_text = [] 
  
for word in text_tokens: 
    if word not in stopwords_list: 
        cleansed_text.append(word) 
        
print(text_tokens)
print("size of original text is", len(text_tokens))
print("\n")
print(cleansed_text)
print("size of cleansed text after stopwords removal is", len(cleansed_text))

 

Output:

['I', 'am', 'interested', 'in', 'learning', 'the', 'many', 'aspects', 'of', 'data', 'science', '.', 'I', 'also', 'want', 'to', 'learn', 'natural', 'language', 'processing', 'tasks', 'which', 'can', 'be', 'applied', 'practically', 'and', 'help', 'to', 'solve', 'real', 'problems', '.']
size of original text is 33

['I', 'interested', 'learning', 'many', 'aspects', 'data', 'science', '.', 'I', 'also', 'want', 'learn', 'natural', 'language', 'processing', 'tasks', 'applied', 'practically', 'help', 'solve', 'real', 'problems', '.']
size of cleansed text after stopwords removal is 23

 

As can be seen in the output, the text size was reduced from 33 to 23 words after stop words removal.

 

The following code inputs text from a file, removes stop words, and writes the cleansed text directly to a new file.

from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english')) 
f = open("corpus.txt") # this corpus contains the same text as the above example
text = f.read()
tokens = text.split() 
for token in tokens: 
    if not token in stop_words: 
        f2 = open('cleansed_text.txt','a') 
        f2.write(token + " ") 
        f2.close() 

 

The output would be seen in the generated text file named ‘cleansed_text.txt’ in the same directory.

 

The following is the code to remove stop words from a text in Gensim

from gensim.parsing.preprocessing import remove_stopwords
text = """I am interested in learning the many aspects of data science. \
I also want to learn natural language processing tasks which can be \
applied practically and help to solve real problems."""
cleansed_text = remove_stopwords(text)
print(text.split())
print(len(text.split()), "\n")
print(cleansed_text) 
print(len(cleansed_text.split()))

 

Output:

['I', 'am', 'interested', 'in', 'learning', 'the', 'many', 'aspects', 'of', 'data', 'science.', 'I', 'also', 'want', 'to', 'learn', 'natural', 'language', 'processing', 'tasks', 'which', 'can', 'be', 'applied', 'practically', 'and', 'help', 'to', 'solve', 'real', 'problems.']
31 
I interested learning aspects data science. I want learn natural language processing tasks applied practically help solve real problems.
19

 

That’s about it!

What are your thoughts? Please mention anything useful I might have missed in the comments. The aim here is to learn together 🙂

 

Photo by katerha

3

Mohammad D.

Mohammad D. works with sentiment anlaysis, NLP and Python. He loves to blog about these and other related topics in his free time.
LinkedIn

Leave a Reply