Wordcount vs TF.IDF for Text Processing
* 2 min read

In this post, we will briefly highlight and demonstrate the difference between using wordcounts and tf.idf values to create features vectors from text. As in any text classification task, before we can process text, we need to convert it to a numerical representation that the computer can understand. The simplest form is simply taking word counts, and creating features vectors as a representation of the documents in the dataset. Each row is a features vector that represents the document, and the column values are a numerical value that shows how many times the term appears in the document. The column spans the entire vocabulary in the dataset.

 

The code below shows a brief example of how the features vectors of a small dataset are generated using both word counts (using the CountVectorizer() function), and then tf.idf weights (using the TfidfVectorizer() function).

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from pandas import DataFrame

def get_term_matrix(docs, vectorizer):
    term_matrix = vectorizer.fit_transform(docs)
    return DataFrame(term_matrix.toarray(), 
                     columns = vectorizer.get_feature_names())

# create a small dataset of docs    
docs = ["I love working with Python", "Python is a good for working with text"]

# create a term count vectorizer (BOW)
term_count_vectorizer = CountVectorizer()

# create a tfidf weights vectorizer
tfidf_vectorizer = TfidfVectorizer()

print("Matrix generated using word counts as features..")
print(get_term_matrix(docs, term_count_vectorizer))

print("\nMatrix generated using tfidf weights as features..")
print(get_term_matrix(docs, tfidf_vectorizer))

 

The output below shows the difference in how terms are represented. The first matrix uses word counts. The second uses tf.idf weights. The latter would be more informative for the classifier, since this way terms are given weights based on their relevancy and importance in the dataset.

Matrix generated using word counts as features..
   for  good  is  love  python  text  with  working
0    0     0   0     1       1     0     1        1
1    1     1   1     0       1     1     1        1

Matrix generated using tfidf weights as features..
        for      good        is    ...         text      with   working
0  0.000000  0.000000  0.000000    ...     0.000000  0.448321  0.448321
1  0.425677  0.425677  0.425677    ...     0.425677  0.302873  0.302873

[2 rows x 8 columns]

 

This was a just a brief post to demonstrate how the features vectors appear. Now you can try both techniques and see which contributes more to accuracy 🙂

 

1

Mohammad D.

Mohammad D. works with sentiment anlaysis, NLP and Python. He loves to blog about these and other related topics in his free time.
LinkedIn

Leave a Reply