Here we perform sentiment classification on some movie reviews. We use the default Naive Bayes classifier found in NLTK. The dataset used is the Movie Reviews Dataset which we download from NLTK. For training, we use 90% of the data as training, and the remaining 10% for testing. We only use the 3000 most frequent words in the dataset for this. These are shuffled prior to classification, so you may get slightly different results on each run.
NLTK Naive Bayes classifier
The Naive Bayes classifier is a supervised machine learning classifier that makes an independence assumption on each feature in the features vector used for training. In other words, In simple terms, it is ‘naive’ in that it assumes that the appearance of a certain feature in a class is not related to any other feature in the class. Although it is assumed to be naive, it is a powerful algorithm, and is used often in supervised classification.
NLTK’s default Naive Bayes classifier stores features as dictionaries entries, which simply maps a feature name to some value. In this case, the word itself is the feature, and the value is a boolean which is true if the word exists, and false otherwise. This gives us a Bag of Words model, where the order or context of the words is ignored.
The full code
The output is shown below. The final accuracy is 0.835, not bad for a simple classifier with default settings. The 10 most informative features are also displayed. For example, “sucks” appears with a ratio of 14.2 to 1 in the neg:pos classes, while “regard” appears with a ratio of 10.4 to 1 in the pos:neg classes.
Number of docs in dataset is 2000 Classes available: ['neg', 'pos'] Docs in training dataset: 1800 Docs in test dataset: 200 Final accuracy is: 0.835 Most Informative Features contains(sucks) = True neg : pos = 14.2 : 1.0 contains(regard) = True pos : neg = 10.4 : 1.0 contains(annual) = True pos : neg = 8.4 : 1.0 contains(miscast) = True neg : pos = 7.7 : 1.0 contains(groan) = True neg : pos = 7.6 : 1.0 contains(silverstone) = True neg : pos = 7.6 : 1.0 contains(unimaginative) = True neg : pos = 7.0 : 1.0 contains(shoddy) = True neg : pos = 7.0 : 1.0 contains(mena) = True neg : pos = 7.0 : 1.0 contains(suvari) = True neg : pos = 7.0 : 1.0 None ~~~~~-Execution time: 0.3960246165593465 Min~~~~~
Here we use NLTK’s default Naive Bayes classifier for sentiment classification of movie reviews. The model can certainly be enhanced to perform better, but this baseline model is good for learning how sentiment classification works in the simplest way possible. Next time, we can use the Scikit-learn version of the Naive Bayes classifier, which offers a numerical value for features, and not a simple boolean value like the NLTK NB classifier.