Basic Supervised Sentiment Classification using NLTK Naive Bayes
* 2 min read

Problem definition

Here we perform sentiment classification on some movie reviews. We use the default Naive Bayes classifier found in NLTK. The dataset used is the Movie Reviews Dataset which we download from NLTK. For training, we use 90% of the data as training, and the remaining 10% for testing. We only use the 3000 most frequent words in the dataset for this. These are shuffled prior to classification, so you may get slightly different results on each run.

 

NLTK Naive Bayes classifier

The Naive Bayes classifier is a supervised machine learning classifier that makes an independence assumption on each feature in the features vector used for training. In other words, ┬áIn simple terms, it is ‘naive’ in that it assumes that the appearance of a certain feature in a class is not related to any other feature in the class. Although it is assumed to be naive, it is a powerful algorithm, and is used often in supervised classification.

NLTK’s default Naive Bayes classifier stores features as dictionaries entries, which simply maps a feature name to some value. In this case, the word itself is the feature, and the value is a boolean which is true if the word exists, and false otherwise. This gives us a Bag of Words model, where the order or context of the words is ignored.

 

The full code

 

The output is shown below. The final accuracy is 0.835, not bad for a simple classifier with default settings. The 10 most informative features are also displayed. For example, “sucks” appears with a ratio of 14.2 to 1 in the neg:pos classes, while “regard” appears with a ratio of 10.4 to 1 in the pos:neg classes.

Number of docs in dataset is 2000 
Classes available: ['neg', 'pos']
Docs in training dataset:  1800
Docs in test dataset:  200
Final accuracy is:  0.835
Most Informative Features
         contains(sucks) = True              neg : pos    =     14.2 : 1.0
        contains(regard) = True              pos : neg    =     10.4 : 1.0
        contains(annual) = True              pos : neg    =      8.4 : 1.0
       contains(miscast) = True              neg : pos    =      7.7 : 1.0
         contains(groan) = True              neg : pos    =      7.6 : 1.0
   contains(silverstone) = True              neg : pos    =      7.6 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.0 : 1.0
        contains(shoddy) = True              neg : pos    =      7.0 : 1.0
          contains(mena) = True              neg : pos    =      7.0 : 1.0
        contains(suvari) = True              neg : pos    =      7.0 : 1.0
None
~~~~~-Execution time: 0.3960246165593465 Min~~~~~

 

Conclusion

Here we use NLTK’s default Naive Bayes classifier for sentiment classification of movie reviews. The model can certainly be enhanced to perform better, but this baseline model is good for learning how sentiment classification works in the simplest way possible. Next time, we can use the Scikit-learn version of the Naive Bayes classifier, which offers a numerical value for features, and not a simple boolean value like the NLTK NB classifier.

 

1

Mohammad D.

Mohammad D. works with sentiment anlaysis, NLP and Python. He loves to blog about these and other related topics in his free time.
LinkedIn

Leave a Reply