In this post, we will briefly highlight and demonstrate the difference between using wordcounts and tf.idf values to create features vectors from text. As in any text classification task, before we can process text, we need to convert it to a numerical representation that the computer can understand. The simplest form is simply taking word counts, and creating features vectors as a representation of the documents in the dataset. Each row is a features vector that represents the document, and the column values are a numerical value that shows how many times the term appears in the document. The column spans the entire vocabulary in the dataset.
The code below shows a brief example of how the features vectors of a small dataset are generated using both word counts (using the CountVectorizer() function), and then tf.idf weights (using the TfidfVectorizer() function).
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from pandas import DataFrame def get_term_matrix(docs, vectorizer): term_matrix = vectorizer.fit_transform(docs) return DataFrame(term_matrix.toarray(), columns = vectorizer.get_feature_names()) # create a small dataset of docs docs = ["I love working with Python", "Python is a good for working with text"] # create a term count vectorizer (BOW) term_count_vectorizer = CountVectorizer() # create a tfidf weights vectorizer tfidf_vectorizer = TfidfVectorizer() print("Matrix generated using word counts as features..") print(get_term_matrix(docs, term_count_vectorizer)) print("\nMatrix generated using tfidf weights as features..") print(get_term_matrix(docs, tfidf_vectorizer))
The output below shows the difference in how terms are represented. The first matrix uses word counts. The second uses tf.idf weights. The latter would be more informative for the classifier, since this way terms are given weights based on their relevancy and importance in the dataset.
Matrix generated using word counts as features.. for good is love python text with working 0 0 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 Matrix generated using tfidf weights as features.. for good is ... text with working 0 0.000000 0.000000 0.000000 ... 0.000000 0.448321 0.448321 1 0.425677 0.425677 0.425677 ... 0.425677 0.302873 0.302873 [2 rows x 8 columns]
This was a just a brief post to demonstrate how the features vectors appear. Now you can try both techniques and see which contributes more to accuracy 🙂
4