How to Create a Powerful TF-IDF Keyword Research Tool

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

We are at the age of digital marketing and now the words are more important than ever. One of the most successful techniques for digital marketing is competition analysis and keyword research. In other words what our competitors talk about. This is mostly useful for Search Engine Optimization but also for blog post Ideas etc.

Step 1: Get the text from a website

In this chapter, we will create a function that extracts the clean text from a URL so we can use it later for our analysis.

import pandas as pd
import numpy as np
import urllib
from fake_useragent import UserAgent
import requests
import re
from urllib.request import Request, urlopen
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import math
from nltk.corpus import stopwords
stopWords = list(set(stopwords.words('english')))
from bs4 import BeautifulSoup



def get_text(url):
    try:
        req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
        webpage = urlopen(req,timeout=5).read()
        soup = BeautifulSoup(webpage, "html.parser")
        texts = soup.findAll(text=True)
        res=u" ".join(t.strip() for t in texts if t.parent.name not in ['style', 'script', 'head', 'title', 'meta', '[document]'])
        return(res)
    except:
        return False

Let’s have an example.

get_text('https://en.wikipedia.org/wiki/Machine_learning')[0:500]
#I will return the first 500 characters
  CentralNotice    Machine learning   From Wikipedia, the free encyclopedia     Jump to navigation  Jump to search  For the journal, see Machine Learning (journal) .  "Statistical learning" redirects here. For statistical learning in linguistics, see statistical learning in language acquisition .  Scientific study of algorithms and statistical models that computer systems use to perform tasks without explicit instructions  Part of a series on Machine learning and data mining  Problems  Class'

Success! Now we can get the clean text from a website! But how exactly we can use it? Let’s jump into our next chapter.

Step 2: Get the URLs from competitors

The best way to find the best competitors is to get the top results of a keyword of our interest in Google search. We will use the code from a previous post, How To Scrape Google Results For Free Using Python.

def google_results(keyword, n_results):
    query = keyword
    query = urllib.parse.quote_plus(query) # Format into URL encoding
    number_result = n_results
    ua = UserAgent()
    google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
    response = requests.get(google_url, {"User-Agent": ua.random})
    soup = BeautifulSoup(response.text, "html.parser")
    result_div = soup.find_all('div', attrs = {'class': 'ZINbbc'})
    results=[re.search('\/url\?q\=(.*)\&sa',str(i.find('a', href = True)['href'])) for i in result_div]
    links=[i.group(1) for i in results if i != None]
    return (links)

Let’s say that we want to see our “competitors” for the keyword “machine learning blog”. Let’s get the top URLs using the google results function where the first variable is the keyword and the second the number of results.

google_results('machine learning blog',10)
['https://towardsai.net/p/machine-learning/best-machine-learning-blogs-6730ea2df3bd',
 'https://machinelearningmastery.com/blog/',
 'https://towardsdatascience.com/how-to-start-a-machine-learning-blog-in-a-month-7eaf84692df9',
 'http://ai.googleblog.com/',
 'https://www.springboard.com/blog/machine-learning-blog/',
 'https://blog.ml.cmu.edu/',
 'https://blog.feedspot.com/machine_learning_blogs/',
 'https://aws.amazon.com/blogs/machine-learning/',
 'https://neptune.ai/blog/the-best-regularly-updated-machine-learning-blogs-or-resources',
 'https://www.stxnext.com/blog/best-machine-learning-blogs-resources/']

Step 3: Analyse the text and get the most important words.

Let’s think. What are the most important words? For our analysis, we will use 3 metrics. Average TF-IDF, Max TF-IDF, and the Frequency. The pipeline is the following. We will get the text for every website (in our case top 12 results) and use those as a corpus for a TF-IDF vectorizer. Then from this matrix, we will get the average and the max TF-IDF score for every word. Then we can easily get the frequency from the TF-IDF matric by saying that the word is contained in the URL if it’s not equal to zero in this row and we are computing the percentage of it. The complete function is the following.

def tf_idf_analysis(keyword):
    links=google_results(keyword,12)
    text=[]
    for i in links:
        t=get_text(i)
        if t:
            text.append(t)
            
    v = TfidfVectorizer(min_df=2,analyzer='word',ngram_range=(1,5),stop_words=stopWords)
    x = v.fit_transform(text)

    f = pd.DataFrame(x.toarray(), columns = v.get_feature_names())
    d=pd.concat([pd.DataFrame(f.mean(axis=0)),pd.DataFrame(f.max(axis=0))],axis=1)
    
    
    tf=pd.DataFrame((f>0).sum(axis=0))


    d=d.reset_index().merge(tf.reset_index(),on='index',how='left')

    d.columns=['word','average_tfidf','max_tfidf','frequency']

#you can comment the following part if you want the number of URLs that the word occurs. The percentage makes sense
#when we have a lot of URLs to check

    d['frequency']=round((d['frequency']/len(text))*100)

    return(d)

Now that our final function is ready, Let’s have a look at our competitors in machine learning by using the machine learning blog keyword.

keyword research tool
x= tf_idf_analysis('machine learning blog')

#remove the numbers and sort by max tfidf and get the top20 words
x[x['word'].str.isalpha()].sort_values('max_tfidf',ascending=False).head(20)
           word  average_tfidf  max_tfidf  frequency
929      google       0.098790   0.626160       67.0
254         aws       0.052512   0.550785       25.0
171      amazon       0.060131   0.537993       33.0
1472      model       0.058276   0.521179       33.0
307        blog       0.131429   0.385008      100.0
133          ai       0.109516   0.358522       83.0
1222   learning       0.191090   0.352528      100.0
717         end       0.036682   0.304649       58.0
1332    machine       0.158022   0.295191      100.0
525     cookies       0.023013   0.263509       17.0
1980        see       0.030134   0.255031       58.0
439         cmu       0.028235   0.253162       17.0
2242    towards       0.035054   0.245614       42.0
862   followers       0.022837   0.245576       17.0
1949    science       0.057179   0.240060       58.0
670      domain       0.021410   0.236214       25.0
739       entry       0.022586   0.233097       17.0
944    gradient       0.019838   0.233097       17.0
377    brownlee       0.021105   0.226498       25.0
537     courses       0.024935   0.218224       25.0

As we can see one of the top words is Amazon AWS. Hmm, maybe we should write something about that too 😉. This can be a powerful tool for digital marketing and there are many paid services that are doing it. So start to experiment with it!

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.