In this post, we use the Pickle module in Python. Using Pickle, we can train any machine learning classifier, and then save the serialized trained classifier to a .pickle file, for later processing. This is especially useful if we are training the classifier using a large training dataset, which can be time consuming. So when using the classifier on live data where time is important, we can save time and provide more efficient processing when loading the trained classifier from a file rather than training it on every run.
The following code takes 2000 movie reviews, trains an NLTK Naive Bayes classifier on 90% of these reviews, and tests the classification accuracy on the remaining 10%. The features here are a simple bag of words model. The features vector for each document is composed of a dictionary of words, and a boolean value as True if they exist in the document, and False otherwise. For this run, we determine the classification accuracy and also the time taken for the entire process.
from nltk.corpus import movie_reviews #get movie reviews dataset (1k pos, 1k neg) import nltk import random import time def main(): #get each doc and corresponding doc class (pos or neg) docs = [(list(movie_reviews.words(fileid)), category) \ for category in movie_reviews.categories() \ for fileid in movie_reviews.fileids(category)] #check number of docs in dataset (should be 2000) print ("Number of docs in dataset is {} ".format(len(movie_reviews.fileids()))) #check classes for docs in dataset (should be 'pos' and 'neg') print ("Classes available: {}".format(movie_reviews.categories())) #mix up the docs random.shuffle(docs) #list the frequency distribution of words and convert to lowercase words_in_dataset = nltk.FreqDist(words.lower() for words in movie_reviews.words()) #get the words to use as features (top 2000 most frequent words only) word_features = list(words_in_dataset)[:3000] #for each doc, for each word feature, check if it is included in doc, #and return a binary value to represent included or not included in doc def get_features(doc): doc_words = set(doc) features = {} for word in word_features: features['contains(%s)' % word] = (word in doc_words) return features #save features for each doc in each class featuresets = [(get_features(_doc), _class) for (_doc,_class) in docs] #give test set 200 docs (10%) and train set 1800 docs (90%) train_set = featuresets[:1800] test_set = featuresets[1800:] print ("Docs in training dataset: ", len(train_set)) print ("Docs in test dataset: ", len(test_set)) #give train set to NB classifier classifier = nltk.NaiveBayesClassifier.train(train_set) #print accuracy print("Final accuracy is: ", end=" ") print(nltk.classify.accuracy(classifier, test_set)) if __name__ == '__main__': startExecutionTime = time.time() main() # get execution time print("~~~~~-Execution time: %s Min~~~~~" % ((time.time() - startExecutionTime)/60))
As shown in the output, it takes 0.75 minutes for the entire training and classification process. The accuracy is 0.79.
Number of docs in dataset is 2000 Classes available: ['neg', 'pos'] Docs in training dataset: 1800 Docs in test dataset: 200 Final accuracy is: 0.79 ~~~~~-Execution time: 0.7543748577435812 Min~~~~~
This next code is similar to the previous, except here we import the Pickle module, and use it to save the trained classifier to as saved_trained_classifier.pickle. I include the full code to avoid any confusion. You should be able to see this newly created file in the current working directory from where your script is saved.
from nltk.corpus import movie_reviews #get movie reviews dataset (1k pos, 1k neg) import nltk import random import time import pickle def main(): #get each doc and corresponding doc class (pos or neg) docs = [(list(movie_reviews.words(fileid)), category) \ for category in movie_reviews.categories() \ for fileid in movie_reviews.fileids(category)] #check number of docs in dataset (should be 2000) print ("Number of docs in dataset is {} ".format(len(movie_reviews.fileids()))) #check classes for docs in dataset (should be 'pos' and 'neg') print ("Classes available: {}".format(movie_reviews.categories())) #mix up the docs random.shuffle(docs) #list the frequency distribution of words and convert to lowercase words_in_dataset = nltk.FreqDist(words.lower() for words in movie_reviews.words()) #get the words to use as features (top 2000 most frequent words only) word_features = list(words_in_dataset)[:3000] #for each doc, for each word feature, check if it is included in doc, #and return a binary value to represent included or not included in doc def get_features(doc): doc_words = set(doc) features = {} for word in word_features: features['contains(%s)' % word] = (word in doc_words) return features #save features for each doc in each class featuresets = [(get_features(_doc), _class) for (_doc,_class) in docs] #give test set 200 docs (10%) and train set 1800 docs (90%) train_set = featuresets[:1800] test_set = featuresets[1800:] print ("Docs in training dataset: ", len(train_set)) print ("Docs in test dataset: ", len(test_set)) #give train set to NB classifier classifier = nltk.NaiveBayesClassifier.train(train_set) # save the trained classifer to file save_classifier = open("saved_trained_classifier.pickle","wb") pickle.dump(classifier, save_classifier) save_classifier.close() #print accuracy print("Final accuracy is: ", end=" ") print(nltk.classify.accuracy(classifier, test_set)) if __name__ == '__main__': startExecutionTime = time.time() main() # get execution time print("~~~~~-Execution time: %s Min~~~~~" % ((time.time() - startExecutionTime)/60))
Finally, we now run the same code again, but this time instead of training the classifier during execution, we load the saved_trained_classifier.pickle, and use this saved classifier in order to compute the accuracy.
from nltk.corpus import movie_reviews #get movie reviews dataset (1k pos, 1k neg) import nltk import random import time import pickle def main(): #get each doc and corresponding doc class (pos or neg) docs = [(list(movie_reviews.words(fileid)), category) \ for category in movie_reviews.categories() \ for fileid in movie_reviews.fileids(category)] #check number of docs in dataset (should be 2000) print ("Number of docs in dataset is {} ".format(len(movie_reviews.fileids()))) #check classes for docs in dataset (should be 'pos' and 'neg') print ("Classes available: {}".format(movie_reviews.categories())) #mix up the docs random.shuffle(docs) #list the frequency distribution of words and convert to lowercase words_in_dataset = nltk.FreqDist(words.lower() for words in movie_reviews.words()) #get the words to use as features (top 2000 most frequent words only) word_features = list(words_in_dataset)[:3000] #for each doc, for each word feature, check if it is included in doc, #and return a binary value to represent included or not included in doc def get_features(doc): doc_words = set(doc) features = {} for word in word_features: features['contains(%s)' % word] = (word in doc_words) return features #save features for each doc in each class featuresets = [(get_features(_doc), _class) for (_doc,_class) in docs] #give test set 200 docs (10%) and train set 1800 docs (90%) #train_set = featuresets[:1800] test_set = featuresets[1800:] #print ("Docs in training dataset: ", len(train_set)) print ("Docs in test dataset: ", len(test_set)) #give train set to NB classifier #classifier = nltk.NaiveBayesClassifier.train(train_set) # save the trained classifer to file # save_classifier = open("saved_trained_classifier.pickle","wb") # pickle.dump(classifier, save_classifier) # save_classifier.close() classifier_file = open("saved_trained_classifier.pickle", "rb") classifier = pickle.load(classifier_file) classifier_file.close() print("Final accuracy is: ", end=" ") print(nltk.classify.accuracy(classifier, test_set)) if __name__ == '__main__': startExecutionTime = time.time() main() # get execution time print("~~~~~-Execution time: %s Min~~~~~" % ((time.time() - startExecutionTime)/60))
As shown in the output below, by using this saved trained classifier, we were able to cut the execution time to less than half (from 0.75 min in the previous run, to 0.24 min).
Number of docs in dataset is 2000 Classes available: ['neg', 'pos'] Docs in test dataset: 200 Final accuracy is: 0.9 ~~~~~-Execution time: 0.24980648358662924 Min~~~~~
That’s it! Now you can try this out using your own dataset, and see how much time you save. 🙂
2