What is tf.idf?
Term frequency-inverse document frequency (tf.idf) is a popular statistical measure that estimates the relevancy of a particular word is to a document, in a corpus of documents. This technique derives from information retrieval. It is such a powerful feature that even the Google Search Engine uses it. But it is also applied in many NLP tasks as well like text classification, summarization, named entity recognition and sentiment analysis.
The tf.idf of a word is measured by tf, or how many times the term belongs in the target document, and idf, or the number of documents in the corpus divided by the number of documents containing the term. The final product of both tf and idf is the final estimate of how relevant the term is in the corpus. The higher the tf.idf score of a word, the more relevant it is considered.
Why is it useful in NLP?
This is beneficial to assign high scores to important words in the document, and penalize those common words that appear many times in the document, but also appear many times in many documents as well. These words can be stop words, or words that do not have an impact on the final analysis, even though they appear many times in the corpus. For example, terms like ‘the’, ‘this’, ‘on’, ‘have, ‘and’, etc. appear many times in a corpus, but are irrelevant in NLP tasks such as text classification, text summarization, chat bots, and search results ranking.
In computational NLP tasks, before analyzing text, it must be turned into numbers that computers and machine learning classifiers can better understand. So transforming text into numbers, or text vectorization, is an initial step, and tf.idf scores, by scoring words based on their relevancy, tells the computer that to emphasize these key words during analysis, while dampening the score of other less important words. These low-score useless words can either be discarded (since they appear frequently and this would help to reduce processing time), or just ignored by during whatever kind of analysis you are doing (e.g. text classification). At the same time, high-score words can be better prioritized, improved the final outcome.
Breaking it down
Term frequency is simply how often the target word appears in the document, and this is simply the raw word count of the word in the document. Note that the term frequency is normalized by dividing by the total number of words in the document.
Inverse document frequency is taken across the number of documents in the corpus. It is simply the log of the number of documents in the entire corpus, divided by the number of documents that contain the word. This computes how ‘rare’ or relevant the target word is. The nearer the value is to 0, the more common the word.
term frequency * inverse document frequency is the product of both of the above, and this is the final tf.idf weight of the word. This would be used in text vectorization, and would place higher emphasis on important words, telling the classifier to prioritize these important words over other words.
Mathematical representation
The final tf.idf score for a word t in document d from the corpus is computed below.
tf.idf(t, d, corpus) = tf(t, d) * idf(t, corpus) where: tf(t, d) = frequency(t) / count of words in d idf(t, corpus) = log(docs in corpus / docs that t appears in)
Simple example
Let’s say we have the word bird appear in a document 8 times, and appears in 20 documents from a 1,000 document corpus. For simplicity, we do not normalize the tf over the count of terms in d in this example. This can be computed as:
tf(bird) = 8 idf(bird) = log(1,000/20) = 1.69 tf.idf(bird) = 8 * 1.69 = 13.52
Now let’s say the word the also appears in the same document 25 times, and appears in 800 of the documents in the 1,000 document corpus. This would be computed as:
tf.idf(the) = 25 * 0.09 = 2.25
As you can see, even though the word the appears 25 times in the document, and bird only appears 8 times, bird is still assigned a higher weight (13.52) compared to the (2.25), since it is a very common word that appears in the majority of documents in the corpus. This would tell the classifier to prioritize bird, and pay less attention to the, during analysis.
Simplified Python Implementation
The following code takes a very small corpus that consists of a few sentences. For simplicity, each of the sentences represents a ‘document’ in the corpus.
import nltk import numpy as np import re article = """ I like to go swimming. Swimming is my favorite sport. I also like other sports too. But this is my favorite. It is a popular sport. I do this every weekend. It is good exercise and a good way to stay in shape. """ corpus = nltk.sent_tokenize(article) # get each sentence from corpus. for x in range(len(corpus)): corpus [x] = corpus [x].lower() corpus [x] = re.sub(r'\W', ' ', corpus [x]) # remove all punctuation and non-characters # get word frequencies wordfreq = {} for sentence in corpus: tokens = nltk.word_tokenize(sentence) for token in tokens: if token not in wordfreq.keys(): wordfreq[token] = 1 else: wordfreq[token] += 1 # get 15 most frequent words in corpus # so each sentnce has a corresponding tfidf vector import heapq most_freq = heapq.nlargest(15, wordfreq, key=wordfreq.get) #print(most_freq) # get tf values word_tf_values = {} for token in most_freq: sent_tf_vector = [] for document in corpus: doc_freq = 0 for word in nltk.word_tokenize(document): if token == word: doc_freq += 1 word_tf = doc_freq/len(nltk.word_tokenize(document)) sent_tf_vector.append(word_tf) word_tf_values[token] = sent_tf_vector # get idf values word_idf_values = {} for token in most_freq: doc_containing_word = 0 for document in corpus: if token in nltk.word_tokenize(document): doc_containing_word += 1 word_idf_values[token] = np.log(len(corpus)/(1 + doc_containing_word)) # get the product of both (tf*idf) tfidf_values = [] for token in word_tf_values.keys(): tfidf_sentences = [] for tf_sentence in word_tf_values[token]: tf_idf_score = tf_sentence * word_idf_values[token] tfidf_sentences.append(tf_idf_score) tfidf_values.append(tfidf_sentences) # create an array where each column is tfidf vector for each sentence tf_idf_model = np.asarray(tfidf_values) print(tf_idf_model)
The output is a small matrix, where each column represents the 15-unit tf.idf vector. Words found in the sentence are given a numerical value that is representative of the tf.idf score, where the higher this is, the more relevant the word is in that particular document. This code was created from scratch for learning purposes, and there are available libraries such as Scikit-learn that provide functionality for easily creating tf.idf vectors from words (TfidfVectorizer).
Final words
In transforming text to a numerical form for classification or other text processing tasks, the tf.idf measure is extremely powerful and informative, and can drastically improve results, as compared to taking something like just the word counts in the document.
2