Market Basket Analysis and Association Rules from Scratch

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

We have provided a tutorial of Market Basket Analysis in Python working with the mlxtend library. Today, we will provide an example of how you can get the association rules from scratch. Let’s recall the 3 most common association rules:

Association Rules

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction. For example, we can extract information on purchasing behavior like “If someone buys beer and sausage, then is likely to buy mustard with high probability

Let’s define the main Associaton Rules:

Support

It calculates how often the product is purchased and is given by the formula:

\(Support(X) = \frac{Frequency(X)}{N (\#of \;Transactions)}\)

\(Support(X \rightarrow Y) = \frac{Frequency(X \bigcap Y)}{N (\#of \;Transactions)}\)

Confidence

It measures how often items in Y appear in transactions that contain X and is given by the formula.

\(Confidence(X \rightarrow Y ) = \frac{ Support(X \rightarrow Y )}{ Support(X) }\)

Lift

It is the value that tells us how likely item Y is bought together with item X. Values greater than one indicate that the items are likely to be purchased together. It tells us how much better a rule is at predicting the result than just assuming the result in the first place. When lift > 1 then the rule is better at predicting the result than guessing. When lift < 1, the rule is doing worse than informed guessing. It can be given by the formula:

\(Lift(X \rightarrow Y ) = \frac{ Support(X \rightarrow Y )}{ Support(X)\times Support(Y) }\)

This image has an empty alt attribute; its file name is mba-1-1024x376.png

Coding Part

By 2 Products

Assume that we are dealing with the following groceries.xlsx file:

Market Basket Analysis and Association Rules from Scratch 1

We want to transform the data into order id and product id.

import pandas as pd

df = pd.read_excel("groceries.xlsx")
df['items'] = df['items'].apply(lambda x: x.split(","))

df = df.explode('items')
df.columns = ['oid', 'pid']
df.reset_index(drop=True, inplace=True)

df
Market Basket Analysis and Association Rules from Scratch 2

Write the function which returns the three association rules such as support, confidence and lift for every possible pair. The my_pid is the antecedent and he y is the consequent.

def all_x_y(df, my_pid, y):
    df = df.copy()
    N = len(df.oid.unique())
    
    tmp = pd.DataFrame({'XY':[my_pid,y]})
    tmp = df.merge(tmp, how='inner', left_on='pid', right_on='XY' )
    
    numerator = sum(tmp.groupby('oid').size()==2)/N
    a = len(df.loc[df.pid==my_pid].oid.unique())/N
    b = len(df.loc[df.pid==y].oid.unique())/N
    denominator = a * b
    
        
    lift = numerator/denominator
    confidence = numerator/a
    support = numerator
    
    return (support, confidence, lift)

Let’s see some examples by considering the (milk, bread) and (orange, coffee):

Market Basket Analysis and Association Rules from Scratch 3

You can confirm that we get the same results with that from the mlxtend module:

from mlxtend.frequent_patterns import association_rules, apriori
# compute frequent items using the Apriori algorithm
frequent_itemsets = apriori(onehot, min_support = 0.01, max_len = 2, use_colnames=True)
# compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets, min_threshold=0.01)
rules
 

Now, let’s see how we can get all the possible pairs.

unique_products = df.pid.unique()
output = []

for i in unique_products:
    for j in unique_products:
        if (i!=j):
            tmp = all_x_y(df, i, j)
            output.append({
                'antecedents':i,
                'consequents':j,
                'support':tmp[0],
                'confidence':tmp[1],
                'lift':tmp[2]
                          })

output = pd.DataFrame(output)
output

Market Basket Analysis and Association Rules from Scratch 4

By 3 Products

The Market Basket Analysis and the Association rules are becoming more complicated when we examine more combinations. Let’s say that we want to get all the association rules when the antecedents are 2 and the consequent is 1. I.e we have already two items in the basket, what are the association rules of the extra item. The first that we will need to do is to generate all the possible combinations by 3 (or even by 2, and then to add the right-hand side). For example:

x = list(itertools.combinations(unique_products, 3))
x
Market Basket Analysis and Association Rules from Scratch 5

In another tutorial, we will show you how you can generate the association rules for more than two items. Stay tuned!

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.