Text Mining in Python – TF-IDF
Want to share your content on python-bloggers? click here.
Introduction
“Changes are shifting outside the words.”
Annie Lennox
In the Visualizing Text Data lesson we learned how to use term frequency to identify the most frequent words in a document. However, this method only considers the frequency of each term within a single document and doesn’t account for the frequency of the same word in other documents in the corpus. To account for this limitation, TF-IDF was introduced. TF-IDF stands for Term Frequency – Inverse Document Frequency. It is a numerical statistic intended to reflect how important a word is to a document in a collection or corpus. In this lesson, we’ll explore how to calculate TF-IDF scores and visualize the results using Python.
Data source
Data used in this lesson is available on the Oxford Text Archive website. To know more about textual data sources, please check out this post: ‘Where to find and how to load historical data’
Coding the past: identifying relevant words in historical documents
1. TF-IDF formula
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical measure that indicates the importance of a word in a document taking into account how frequent the word is in other documents in the same corpus. It consists of multiplying the term frequency (TF) by the inverse document frequency (IDF), which is the logarithm of the total number of documents divided by the number of documents containing the term. The formula is as follows:
\[w_{ij} = tf_{ij} * log(\frac{N}{df_i})\]
Where:
- \(w_{ij}\) is the tf-idf weight for word \(i\) in document \(j\);
- \(tf_{ij}\) is the number of times word \(i\) appears in document \(j\) divided by the total number of words in document \(j\);
- \(N\) is the total number of documents in the corpus;
- \(df_i\) is the number of documents in the corpus that contain word \(i\).
2. TF-IDF calculation example
Suppose you want to calculate the TF-IDF weight for the word “British”, which appears 5 times in a document containing 100 words. Given a corpus containing 4 documents, with 2 documents mentioning the word “British”, TF-IDF can be calculated by:
\[w_{British} = \frac{5}{100} * log(\frac{4}{2}) = 0.015\]
TF-IDF increases as the term frequency increases, but it decreases as the number of times the word appears in other documents in the corpus increases. Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query.
3. Preprocessing text data in Python
We will be using the same functions from the lesson Visualizing Text Data to preprocess the text data. The functions are:
load_text
: loads the text from a txt file and returns a list of words;prepare_text
: preprocesses the text by removing stopwords, removing words with less than 3 characters, and transforming all words to lower case;count_freq
: counts the frequency of each word in the the document and returns a dataframe with the results.
content_copy
Copy
After loading the functions above, we can use them to preprocess the text data.
Now we load the manifests of three authors: Oxenbridge Thacher, James Otis, and James Mayhew. The results are stored in three lists called thacher
, otis
, and mayhew
. After that, we preprocess the text data using the function prepare_text
and count the frequency of each word in each document using the function count_freq
. Results are stored in three dataframes called thacher_df
, otis_df
, and mayhew_df
.
content_copy
Copy
3. Calculating term frequencies for each document
As we have seen, the first component to calculate the TF-IDF weight is the term frequency. We can calculate the term frequency for each document by dividing the number of times a word appears in the document by the total number of words in it.
content_copy
Copy
4. Calculating how many times a word appears in each document in the corpus
To calculate \(df_i\), we will left join all our dataframes. Then, for each row (word) we will sum the number of times its term frequency is greater than zero. In other words, we will count in how many documents the word appears at least once. This value will vary from 1 to 3, since we have three documents in our corpus. This count can be made with the pandas method ne()
that checks if the values in the columns specified are not equal (ne) to zero. The results are booleans that can be summed to get the number of documents in which the word appears.
content_copy
Copy
5. TF-IDF calculation
Finally we have all elements to calculate the TF-IDF weight. Note that we will be using base 10 logarithms. To calculate the logarithm, we can use the library math
and its method log10()
. We use the apply()
method to apply the logarithm to each row of dfi and then multiply it by the term frequency. The results are stored in three new columns called TF-IDF_thacher
, TF-IDF_otis
, and TF-IDF_mayhew
. Note that dfi varies from 1 to 3, so when the word appears in all three documents, the logarithm element will be zero and consequently TF-IDF will be zero as well (\(log_{10} 1 = 0\)).
content_copy
Copy
5. Comparing TF top 10 words with TF-IDF top 10 words
Now we can compare how the two methods define the 10 most important words in each document. Keep in mind that the term frequency does not account for the words in other documents of the corpus while TF-IDF does. TF logic is that the most important words are the ones that appear the most in the document. TF-IDF logic is that the most important words are the ones that appear the most in the document but not in the other documents of the corpus. TF-IDF is more sophisticated because it helps you to distinguish one document from the others.
content_copy
Copy
content_copy
Copy
content_copy
Copy
Please note that common words in the corpus, such as “Britain” and “government,” simply do not appear in the top 10 chart because they are present in all three documents. The intuition behind this is that these words are so common in the corpus that they do not provide much useful information. The top 10 words found with TF-IDF have a stronger explanatory power in distinguishing between the three authors. Furthermore, despite being present in two documents, the word “colonies” continues to have a strong TF-IDF score. This is because the word is not present in all three documents, and it is very common in the two documents where it appears.
Conclusions
- TF-IDF depends on two factors: the frequency of a word in a document, and the inverse frequency of the word in the corpus;
- It is possible to calculate TF-IDF scores from scratch in Python, which helps you to understand the logic behind the calculation;
- TF-IDF is a useful tool when you want to identify words that are specific to a particular document.
Want to share your content on python-bloggers? click here.