Constructing an Ensemble of Supervised Classifiers for Sentiment Analysis
* 10 min read

What is an Ensemble of Classifiers in Supervised Classification?

In machine learning, an ensemble of supervised classifiers is basically a group, or ‘committee’, of classifiers that classify an input and all vote on a certain label. The decisions made by the classifiers make up a final value that is representative of the final label for that input. In sentiment analysis, for example, if seven out of eight classifiers vote on a document as having a positive label, then the document would be classified as positive, since that is the most common label across the committee of classifiers. A confidence in the classification process could also simply be the amount of classifiers agreeing on that most common label, divided by the total number of classifiers. So in this case, the confidence that this document is classified as having a positive connotation is 0.875, which is high and reliable. Let’s say that only four of the classifiers among the eight label the document as positive, and the remaining assign it with a negative label. In this case, the confidence would only be 0.5, which is low and unreliable.

 

ClassifierI Module in NLTK

So by utilizing the ClassifierI class in NLTK, we can assign a variety of different classifiers, and then create an ensemble where the final label and confidence of a document is represented as the vote across the entire committee of classifiers, where each gets an equal say in the classification decision. This allows for an estimation of the confidence in classification, which may be a useful indicator of the the classification model is performing.

 

The full code

The following code takes the first 5k positive and 5k negative documents from the Stanford Large Movie Reviews dataset. We train on 90% (9,000 documents) of the data for each of the classifiers used. Here we also save all of the trained classifiers in .pickle format so that we do not have to constantly train the classifiers on every run. For preprocessing, we perform punctuation and stop words removal, lowercase conversion, and tokenization for each document. We compute and display the accuracy for each of the classifiers, and then for the ensemble, which basically displays the class which is the most agreed upon. Finally, the confidence value is shown for the first three documents in the testing set. The confidence (conf) value is simply a percentage of the proportion of classifiers in agreement on the class, over all the classifiers available.

import nltk
import time
import random
import re
import glob
import pickle
from nltk.corpus import stopwords
# import classifiers
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.tokenize import word_tokenize
from nltk.classify import ClassifierI
from statistics import mode

# constructing the Ensemble class
class Ensemble(ClassifierI):
    
    def __init__(self, *classifiers):
        self._classifiers = classifiers
    
    def classify(self, features):
        votes = []
        for classifier in self._classifiers:
            vote = classifier.classify(features)
            votes.append(vote)
        return mode(votes)
    
    def conf(self, features):
        votes = []
        for classifier in self._classifiers:
            vote = classifier.classify(features)
            votes.append(vote)        
        
        vote_counts = votes.count(mode(votes))
        conf = vote_counts / len(votes)
        return conf
def main():
    
    # load dataset docs
    pos_docs = glob.glob('test\\pos\\*.txt') # open dir of pos docs
    neg_docs = glob.glob('test\\neg\\*.txt') # open dir of pos docs
    pos_dataset = []
    neg_dataset = []
    
    # read pos docs one by one
    for doc in pos_docs[:5000]:
        f = open(doc, encoding="utf8")
        doc = f.read()
        pos_dataset.append(doc)
        f.close()
    
    # read neg docs one by one
    for doc in neg_docs[:5000]:
        f = open(doc, encoding="utf8")
        doc = f.read()
        neg_dataset.append(doc)
        f.close()
    
    all_words = []
    docs = []
    
    #define stopwords
    stop_words = list(set(stopwords.words('english')))
    
    ### preprocessing of pos docs: 
    for doc in pos_dataset:
        
        # generate tuples of review - label pairs
        docs.append((doc, "pos"))
        # punctuation removal 
        cleaned_doc = re.sub(r'[^(a-zA-Z)\s]','', doc)
        # tokenization to create a BOW model
        tokenized_doc = word_tokenize(cleaned_doc)
        # stopwords removal
        final_doc = [w for w in tokenized_doc if not w in stop_words]
        # conversion to lowercase
        for word in final_doc:
            all_words.append(word.lower())
    
    ### preprocessing of neg docs
    for doc in neg_dataset:
        
        # generate tuples of review - label pairs 
        docs.append((doc, "neg"))
        # punctuation removal
        cleaned_doc = re.sub(r'[^(a-zA-Z)\s]','', doc)
        # tokenization to create BOW model
        tokenized_doc = word_tokenize(cleaned_doc)
        # stopwords removal 
        final_doc = [w for w in tokenized_doc if not w in stop_words]
        # conversion to lowercase
        for w in final_doc:
            all_words.append(w.lower())           
    
    # taking words and generating a frequency distribution
    all_words = nltk.FreqDist(all_words)
    
    # using the most frequent words as the features vector for each review
    word_features = list(all_words.keys())[:3000]
    
    # saving the most frequent words as a pickle 
    pickle_jar = open("pickle_jar/word_feats.pickle", "wb")
    pickle.dump(word_features, pickle_jar)
    pickle_jar.close()
    
    # generating a dictionary of features 
    #(keys are words and values are boolean values as True if word exists in doc)
    def find_features(document):
        words = word_tokenize(document)
        features = {}
        for word in word_features:
            features[word] = (word in words)
        return features
    
    # creating a features vector for each review
    featuresets = [(find_features(review), category) for (review, category) in docs]
    
    # always good to shuffle the docs
    random.shuffle(featuresets)
    
    #defining the train-test sets
    training_set = featuresets[:9000]
    testing_set = featuresets[9000:]
    
    # performing classification using various classifiers, and an ensemble
    print("Using classifiers using {} training samples and {} testing samples...\n".format(len(training_set), len(testing_set)))
       
    # multinomial naive bayes classifier
    MNB_classifier = SklearnClassifier(MultinomialNB())
    MNB_classifier.train(training_set)
    print("MNB_classifier accuracy percent:",(nltk.classify.accuracy(MNB_classifier, testing_set))*100)
    pickle_jar = open("pickle_jar/MNB9k.pickle", 'wb')
    pickle.dump(MNB_classifier, pickle_jar)
    pickle_jar.close()
    
    # Bernoulli naive bayes classifier
    BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
    BernoulliNB_classifier.train(training_set)
    print("BernoulliNB_classifier accuracy:",(nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)
    pickle_jar = open("pickle_jar/Bernoulli9k.pickle", 'wb')
    pickle.dump(BernoulliNB_classifier, pickle_jar)
    pickle_jar.close()
    
    # Logistic regression classifier
    LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
    LogisticRegression_classifier.train(training_set)
    print("LogisticRegression_classifier accuracy:",(nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)
    pickle_jar = open("pickle_jar/LogRes9k.pickle", 'wb')
    pickle.dump(LogisticRegression_classifier, pickle_jar)
    pickle_jar.close()
    
    # SGD classifier
    SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
    SGDClassifier_classifier.train(training_set)
    print("SGDClassifier_classifier accuracy:",(nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)
    pickle_jar = open("pickle_jar/SGD9k.pickle", 'wb')
    pickle.dump(SGDClassifier_classifier, pickle_jar)
    pickle_jar.close()
    
    # Support vector classifier
    SVC_classifier = SklearnClassifier(SVC())
    SVC_classifier.train(training_set)
    print("SVC_classifier accuracy:",(nltk.classify.accuracy(SVC_classifier, testing_set))*100)
    pickle_jar = open("pickle_jar/SVC9k.pickle", 'wb')
    pickle.dump(SVC_classifier, pickle_jar)
    pickle_jar.close()
    
    # Linear version of support vector classifier
    LinearSVC_classifier = SklearnClassifier(LinearSVC())
    LinearSVC_classifier.train(training_set)
    print("LinearSVC_classifier accuracy:",(nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)
    pickle_jar = open("pickle_jar/LinearSVC9k.pickle", 'wb')
    pickle.dump(LinearSVC_classifier, pickle_jar)
    pickle_jar.close()
    
    # Nu support vector classifier 
    NuSVC_classifier = SklearnClassifier(NuSVC())
    NuSVC_classifier.train(training_set)
    print("NuSVC_classifier accuracy:",(nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)
    pickle_jar = open("pickle_jar/NuSVC9k.pickle", 'wb')
    pickle.dump(NuSVC_classifier, pickle_jar)
    pickle_jar.close()

    # creating an ensemble using all of the classifiers available
    ensemble = Ensemble(MNB_classifier, 
                        BernoulliNB_classifier, 
                        LogisticRegression_classifier, 
                        SGDClassifier_classifier, 
                        SVC_classifier, 
                        LinearSVC_classifier, 
                        NuSVC_classifier)
    print("Ensemble Classifier accuracy:", (nltk.classify.accuracy(ensemble, testing_set))*100)
    # for the first 3 docs, displays the confidence (portion of classifiers in agreement on the polarity label / all classifiers)
    print("Classification", ensemble.classify(testing_set[0][0]), "confidence %", ensemble.conf(testing_set[0][0])*100)
    print("Classification", ensemble.classify(testing_set[1][0]), "confidence %", ensemble.conf(testing_set[1][0])*100)
    print("Classification", ensemble.classify(testing_set[2][0]), "confidence %", ensemble.conf(testing_set[2][0])*100)

if __name__ == '__main__':
    
    startExecutionTime = time.time()
   
    main()
    
    # get execution time 
    print("~~~~~-Execution time: %s Min~~~~~" % ((time.time() - startExecutionTime)/60))

 

Now that we have saved all of the trained classifiers to file, the directory called “pickle_jar”, which should be located in your current working directory, should look something like this:

pickle nltk

 

The following code now runs the entire model, but now uses the trained classifier pickles rather than training them, which would be lengthy if you plan to use a large training set.

import nltk
import time
import random
import re
import glob
import pickle
from nltk.corpus import stopwords
# import classifiers
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.tokenize import word_tokenize
from nltk.classify import ClassifierI
from statistics import mode

# constructing the Ensemble class
class Ensemble(ClassifierI):
    
    def __init__(self, *classifiers):
        self._classifiers = classifiers
    
    def classify(self, features):
        votes = []
        for classifier in self._classifiers:
            vote = classifier.classify(features)
            votes.append(vote)
        return mode(votes)
    
    def conf(self, features):
        votes = []
        for classifier in self._classifiers:
            vote = classifier.classify(features)
            votes.append(vote)        
        
        vote_counts = votes.count(mode(votes))
        conf = vote_counts / len(votes)
        return conf
def main():
    
    # load dataset docs
    pos_docs = glob.glob('test\\pos\\*.txt') # open dir of pos docs
    neg_docs = glob.glob('test\\neg\\*.txt') # open dir of pos docs
    pos_dataset = []
    neg_dataset = []
    
    # read pos docs one by one
    for doc in pos_docs[:5000]:
        f = open(doc, encoding="utf8")
        doc = f.read()
        pos_dataset.append(doc)
        f.close()
    
    # read neg docs one by one
    for doc in neg_docs[:5000]:
        f = open(doc, encoding="utf8")
        doc = f.read()
        neg_dataset.append(doc)
        f.close()
    
    all_words = []
    docs = []
    
    #define stopwords
    stop_words = list(set(stopwords.words('english')))
    
    ### preprocessing of pos docs: 
    for doc in pos_dataset:
        
        # generate tuples of review - label pairs
        docs.append((doc, "pos"))
        # punctuation removal 
        cleaned_doc = re.sub(r'[^(a-zA-Z)\s]','', doc)
        # tokenization to create a BOW model
        tokenized_doc = word_tokenize(cleaned_doc)
        # stopwords removal
        final_doc = [w for w in tokenized_doc if not w in stop_words]
        # conversion to lowercase
        for word in final_doc:
            all_words.append(word.lower())
    
    ### preprocessing of neg docs
    for doc in neg_dataset:
        
        # generate tuples of review - label pairs 
        docs.append((doc, "neg"))
        # punctuation removal
        cleaned_doc = re.sub(r'[^(a-zA-Z)\s]','', doc)
        # tokenization to create BOW model
        tokenized_doc = word_tokenize(cleaned_doc)
        # stopwords removal 
        final_doc = [w for w in tokenized_doc if not w in stop_words]
        # conversion to lowercase
        for w in final_doc:
            all_words.append(w.lower())           
    
    # taking words and generating a frequency distribution
    #all_words = nltk.FreqDist(all_words)
    
    # using the most frequent words as the features vector for each review
    #word_features = list(all_words.keys())[:3000]
    
    # saving the most frequent words as a pickle 
#    pickle_jar = open("pickle_jar/word_feats.pickle", "wb")
#    pickle.dump(word_features, pickle_jar)
#    pickle_jar.close()
    
    # opening frequent words pickle from file
    f = open("pickle_jar/word_feats.pickle", "rb")
    word_features = pickle.load(f)
    f.close()
    
    # generating a dictionary of features 
    #(keys are words and values are boolean values as True if word exists in doc)
    def find_features(document):
        words = word_tokenize(document)
        features = {}
        for word in word_features:
            features[word] = (word in words)
        return features
    
    # creating a features vector for each review
    featuresets = [(find_features(review), category) for (review, category) in docs]
    
    # always good to shuffle the docs
    random.shuffle(featuresets)
    
    #defining the train-test sets
    training_set = featuresets[:9000]
    testing_set = featuresets[9000:]
    
    # performing classification using various classifiers, and an ensemble
    print("Using classifiers using {} training samples and {} testing samples...\n".format(len(training_set), len(testing_set)))
       
    # multinomial naive bayes classifier
#    MNB_classifier = SklearnClassifier(MultinomialNB())
#    MNB_classifier.train(training_set)
#    print("MNB_classifier accuracy percent:",(nltk.classify.accuracy(MNB_classifier, testing_set))*100)
#    pickle_jar = open("pickle_jar/MNB9k.pickle", 'wb')
#    pickle.dump(MNB_classifier, pickle_jar)
#    pickle_jar.close()
    
    # opening classifier pickles from file, and performing classification
    f = open("pickle_jar/MNB9k.pickle", "rb")
    MNB_classifier = pickle.load(f)
    f.close()
    print("MNB_classifier accuracy percent:",(nltk.classify.accuracy(MNB_classifier, testing_set))*100)
    
    # Bernoulli naive bayes classifier
#    BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
#    BernoulliNB_classifier.train(training_set)
#    print("BernoulliNB_classifier accuracy:",(nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)
#    pickle_jar = open("pickle_jar/Bernoulli9k.pickle", 'wb')
#    pickle.dump(BernoulliNB_classifier, pickle_jar)
#    pickle_jar.close()
    f = open("pickle_jar/Bernoulli9k.pickle", "rb")
    BernoulliNB_classifier = pickle.load(f)
    f.close()
    print("BernoulliNB_classifier accuracy:",(nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)
    
    # Logistic regression classifier
#    LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
#    LogisticRegression_classifier.train(training_set)
#    print("LogisticRegression_classifier accuracy:",(nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)
#
#    pickle_jar = open("pickle_jar/LogRes9k.pickle", 'wb')
#    pickle.dump(LogisticRegression_classifier, pickle_jar)
#    pickle_jar.close()
    
    f = open("pickle_jar/LogRes9k.pickle", "rb")
    LogisticRegression_classifier = pickle.load(f)
    f.close()    
    print("LogisticRegression_classifier accuracy:",(nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)
    # SGD classifier
#    SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
#    SGDClassifier_classifier.train(training_set)
#    print("SGDClassifier_classifier accuracy:",(nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)
#
#    pickle_jar = open("pickle_jar/SGD9k.pickle", 'wb')
#    pickle.dump(SGDClassifier_classifier, pickle_jar)
#    pickle_jar.close()
    f = open("pickle_jar/SGD9k.pickle", "rb")
    SGDClassifier_classifier = pickle.load(f)
    f.close()    
    print("SGDClassifier_classifier accuracy:",(nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)
    
    # Support vector classifier
#    SVC_classifier = SklearnClassifier(SVC())
#    SVC_classifier.train(training_set)
#    print("SVC_classifier accuracy:",(nltk.classify.accuracy(SVC_classifier, testing_set))*100)
#
#    pickle_jar = open("pickle_jar/SVC9k.pickle", 'wb')
#    pickle.dump(SVC_classifier, pickle_jar)
#    pickle_jar.close()
    f = open("pickle_jar/SVC9k.pickle", "rb")
    SVC_classifier = pickle.load(f)
    f.close()    
    print("SVC_classifier accuracy:",(nltk.classify.accuracy(SVC_classifier, testing_set))*100)
    # Linear version of support vector classifier
#    LinearSVC_classifier = SklearnClassifier(LinearSVC())
#    LinearSVC_classifier.train(training_set)
#    print("LinearSVC_classifier accuracy:",(nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)
#
#    pickle_jar = open("pickle_jar/LinearSVC9k.pickle", 'wb')
#    pickle.dump(LinearSVC_classifier, pickle_jar)
#    pickle_jar.close()
    f = open("pickle_jar/LinearSVC9k.pickle", "rb")
    LinearSVC_classifier = pickle.load(f)
    f.close()   
    print("LinearSVC_classifier accuracy:",(nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)
    # Nu support vector classifier 
#    NuSVC_classifier = SklearnClassifier(NuSVC())
#    NuSVC_classifier.train(training_set)
#    print("NuSVC_classifier accuracy:",(nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)
#
#    pickle_jar = open("pickle_jar/NuSVC9k.pickle", 'wb')
#    pickle.dump(NuSVC_classifier, pickle_jar)
#    pickle_jar.close()
    f = open("pickle_jar/NuSVC9k.pickle", "rb")
    NuSVC_classifier = pickle.load(f)
    f.close()   
    print("NuSVC_classifier accuracy:",(nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)
    # creating an ensemble using all of the classifiers available
    ensemble = Ensemble(MNB_classifier, 
                        BernoulliNB_classifier, 
                        LogisticRegression_classifier, 
                        SGDClassifier_classifier, 
                        SVC_classifier, 
                        LinearSVC_classifier, 
                        NuSVC_classifier)
    print("Ensemble Classifier accuracy:", (nltk.classify.accuracy(ensemble, testing_set))*100)
    # for the first 3 docs, displays the confidence (portion of classifiers in agreement on the polarity label / all classifiers)
    print("Classification", ensemble.classify(testing_set[0][0]), "confidence %", ensemble.conf(testing_set[0][0])*100)
    print("Classification", ensemble.classify(testing_set[1][0]), "confidence %", ensemble.conf(testing_set[1][0])*100)
    print("Classification", ensemble.classify(testing_set[2][0]), "confidence %", ensemble.conf(testing_set[2][0])*100)

if __name__ == '__main__':
    
    startExecutionTime = time.time()
   
    main()
    
    # get execution time 
    print("~~~~~-Execution time: %s Min~~~~~" % ((time.time() - startExecutionTime)/60))

 

As shown in the output, we can see the accuracy achieved on 1000k document test set, by each of the individual classifiers, and then at the end by the ensemble classifier. We also show the confidence in the classification of the first three documents in the test set. This gives an idea of how the confidence changes according to the proportion of classifiers that are in agreement on a particular label.

As shown below, the ensemble classification accuracy is about 90%, not so bad at all (but this is because we use 90% of the data for training, and only 10% for testing). This might not be the case for other noisy or unreliable text streams. The confidence in the first two reviews is 100%, which means all of the classifiers vote for the same label, showing the maximum amount of confidence and reliability. The third review has a confidence of 85%, showing the most, but not all, of the classifiers agreed on the label. Nevertheless, it is still considered fairly reliable.

Using classifiers using 9000 training samples and 1000 testing samples...
MNB_classifier accuracy percent: 84.8
BernoulliNB_classifier accuracy: 84.2
LogisticRegression_classifier accuracy: 90.9
SGDClassifier_classifier accuracy: 85.9
SVC_classifier accuracy: 80.80000000000001
LinearSVC_classifier accuracy: 92.2
NuSVC_classifier accuracy: 87.3
Ensemble Classifier accuracy: 90.0
Classification neg confidence % 100.0
Classification neg confidence % 100.0
Classification neg confidence % 85.71428571428571
~~~~~-Execution time: 5.270024605592092 Min~~~~~

 

Now that we have trained all of the classifiers and saved them as pickles, we no longer need certain parts of the code that are involved in training the classifiers, since we simply use the saved pickles. We can now create the final code as a usable module. This module uses the ensemble and the saved trained classifiers that all work as members of this ensemble. We have created a simple function at the end to take any text, classify it as positive or negative based on the available training data, and finally, compute the confidence of the classification decision. We have saved this module as sentiment_analyzer.py.

import pickle
from nltk.tokenize import word_tokenize
from nltk.classify import ClassifierI
from statistics import mode

# constructing the Ensemble class
class Ensemble(ClassifierI):
    
    def __init__(self, *classifiers):
        self._classifiers = classifiers
    
    def classify(self, features):
        votes = []
        for classifier in self._classifiers:
            vote = classifier.classify(features)
            votes.append(vote)
        return mode(votes)
    
    def conf(self, features):
        votes = []
        for classifier in self._classifiers:
            vote = classifier.classify(features)
            votes.append(vote)        
        
        vote_counts = votes.count(mode(votes))
        conf = vote_counts / len(votes)
        return conf
# opening frequent words pickle from file
f = open("pickle_jar/word_feats.pickle", "rb")
word_features = pickle.load(f)
f.close()
# generating a dictionary of features 
#(keys are words and values are boolean values as True if word exists in doc)
def find_features(document):
    words = word_tokenize(document)
    features = {}
    for word in word_features:
        features[word] = (word in words)
    return features
# opening classifier pickles from file, and performing classification
f = open("pickle_jar/MNB9k.pickle", "rb")
MNB_classifier = pickle.load(f)
f.close()
#print("MNB_classifier accuracy percent:",(nltk.classify.accuracy(MNB_classifier, testing_set))*100)
f = open("pickle_jar/Bernoulli9k.pickle", "rb")
BernoulliNB_classifier = pickle.load(f)
f.close()
#print("BernoulliNB_classifier accuracy:",(nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)
f = open("pickle_jar/LogRes9k.pickle", "rb")
LogisticRegression_classifier = pickle.load(f)
f.close()    
#print("LogisticRegression_classifier accuracy:",(nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)
f = open("pickle_jar/SGD9k.pickle", "rb")
SGDClassifier_classifier = pickle.load(f)
f.close()    
#print("SGDClassifier_classifier accuracy:",(nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)
f = open("pickle_jar/SVC9k.pickle", "rb")
SVC_classifier = pickle.load(f)
f.close()    
#print("SVC_classifier accuracy:",(nltk.classify.accuracy(SVC_classifier, testing_set))*100)
f = open("pickle_jar/LinearSVC9k.pickle", "rb")
LinearSVC_classifier = pickle.load(f)
f.close()   
#print("LinearSVC_classifier accuracy:",(nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)
f = open("pickle_jar/NuSVC9k.pickle", "rb")
NuSVC_classifier = pickle.load(f)
f.close()   
#print("NuSVC_classifier accuracy:",(nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)
# creating an ensemble using all of the classifiers available
global ensemble
ensemble = Ensemble(MNB_classifier, 
                    BernoulliNB_classifier, 
                    LogisticRegression_classifier, 
                    SGDClassifier_classifier, 
                    SVC_classifier, 
                    LinearSVC_classifier, 
                    NuSVC_classifier)

def get_sentiment(input_text):
    features = find_features(input_text)
    return ensemble.classify(features), ensemble.conf(features)
        

 

Now that we have the sentiment_analyzer.py module read to go, we can import it and use it as a real-world sentiment classification application on other datasets or on real-time, live sentiment analysis. The code below imports this module and uses it to classify some example reviews we have created.

import sentiment_analyzer as s
print(s.get_sentiment("This movie was great. The sound quality was amazing."))
print(s.get_sentiment("Terrible movie. Acting was bad. Hate it!"))
print(s.get_sentiment("Was not so good. Just read the book!"))
print(s.get_sentiment("Was not worth the time!"))

 

As shown in the output below, our model was able to correctly classify each of the short reviews with the correct polarity and with high confidence, using an ensemble of classifiers available in the Scikit-learn library.

('pos', 1.0)
('neg', 1.0)
('neg', 0.8571428571428571)
('neg', 0.7142857142857143)

 

That’s it! Feel free to comment on how we can further improve this module for use on real-world data! 🙂

 

4

Mohammad D.

Mohammad D. works with sentiment anlaysis, NLP and Python. He loves to blog about these and other related topics in his free time.
LinkedIn

Leave a Reply