Fuzzy Joins Tutorial

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

We have provided examples of how you can apply fuzzy joins in R and we assume that you are familiar with string distances and similarities. In this tutorial, we will show how you can apply fuzzy join in Python. Since we work mainly with the Levenshtein distance, it will be helpful to provide here the formula:

Fuzzy Joins Tutorial 1

where “a” and “b” are strings and the min refers to the deletion, insertion, and substitution. Notice that is very helpful to provide the Levenshtein Similarity Ratio which can be calculated from the formula below:

Fuzzy Joins Tutorial 2

where |a| and |b| refer to the length of the strings a and b respectively.

Example of Fuzzy Joins

We will provide a practical example of Fuzzy Joins. We will work with the FuzzyWuzzy and textdistance libraries. Assume that you are dealing with two different data frames and you want to match them on the string. Below we provide the two hypothetical data frames:

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import textdistance

df1 = pd.DataFrame({'Text_A':['12 Estias Street, Ampelokipi',
                             'Georgios Pipis',
                             'fuzzy much in python',
                             'Today is Friday! TGIF',
                             'This is a partial text']})


df2 = pd.DataFrame({'Text_B':['Predictive Hacks is a Data Science Blog',
                             'Abelokipi, Estias Str 12',
                             'Fuzzy matching in Python',
                             'George P. Pipis',
                             'partial text']})
 
Fuzzy Joins Tutorial 3

Cross Join

The first thing that we can do is to cross join the two data frames as follows:

# cross join the data frames

df1['dummy'] = True
df2['dummy'] = True

df = pd.merge(df1,df2, on='dummy')

df.drop('dummy', axis=1, inplace=True)
df
 
Fuzzy Joins Tutorial 4

FuzzyWuzzy Library

We will caclulate the follwing ratios between the two columns of our data frame:

  • Ratio: It refers to the Levenshtein Distance Ratio.
  • Partial Ratio: Assume that we are dealing with two strings of different lengths such as L1 and L2, and assume that L1 is less than L2. Then the algorithm seeks the score of the best matching of length -L1 substring.
  • Token Sort Ratio: First it removes punctuations and converts the text to lower case and then it tokenizes it. Then it sorts the tokens alphabetically and then it joins them in a single string.
  • Token Set Ratio: Similar to the Token Sort Ratio, but it takes into consideration the unique tokens.

Below we will return these four measures by adding them as columns to the joined data frame.

df['Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.ratio(x.Text_A, x.Text_B), axis=1)
df['Partial_Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.partial_ratio(x.Text_A, x.Text_B), axis=1)
df['Token_Sort_Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.token_sort_ratio(x.Text_A, x.Text_B), axis=1)
df['Token_Set_Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.token_set_ratio(x.Text_A, x.Text_B), axis=1)

df
 
Fuzzy Joins Tutorial 5

Fuzzy Joins

Since we have calculated the pairwise similarities of the text, we can join the two string columns by keeping the most similar pair. Let’s assume that we want to match df1 on df2. We can group the joined df on Text_A and get the rank of similarities and then keep the most similar (i.e. Rank=1). We have to choose measure, and for this example, we will keep the Token_Set_Ratio

df['Rank_Token_Set_Ratio'] = df.groupby('Text_A')['Token_Set_Ratio'].rank(ascending=False, method='dense')
df[['Text_A','Text_B', 'Token_Set_Ratio']].loc[df.Rank_Token_Set_Ratio==1]
Fuzzy Joins Tutorial 6

As we can see, it captured all the matches but it matched also two strings that are not related. The reason for that is because the text “Today is Friday! TGIF” cannot be matched with any text from Text_B. For that reason, it makes sense to add a threshold. A good threshold is around 70.

df[['Text_A','Text_B', 'Token_Set_Ratio']].loc[(df.Rank_Token_Set_Ratio==1)&(df.Token_Set_Ratio>70)]
 
Fuzzy Joins Tutorial 7

Process Extract

We can find the most similar string out of a vector of strings using the process module:

df1['Most_Similar'] = df1.Text_A.apply(lambda x : [process.extract(x, df2.Text_B, limit=1)][0][0][0])

df1[['Text_A', 'Most_Similar']]
 
Fuzzy Joins Tutorial 8

TextDistance Library

Previously we showed how to get similarity measures based on Levenshtein similarity obtained from the FuzzyWuzzy library. Let’s see how we can get other similarities based on q-grams, like Jaccard and Cosine Similarity using the textdistance package. Let’s return the Cosine and Jaccard similarity of the joined df based on 2-gram based on characters.

# cross join the data frames

df1['dummy'] = True
df2['dummy'] = True

df = pd.merge(df1,df2, on='dummy')

df.drop('dummy', axis=1, inplace=True)

df['Jaccard'] = df.apply(lambda x: (1-textdistance.Jaccard(qval=2).distance(x['Text_A'],  x['Text_B'])), axis=1)
df['Cosine'] = df.apply(lambda x: (1-textdistance.Cosine(qval=2).distance(x['Text_A'],  x['Text_B'])), axis=1)
df
 
Fuzzy Joins Tutorial 9
# get the most similar with a threshold of 0.3
df['Rank_Jaccard'] = df.groupby('Text_A')['Jaccard'].rank(ascending=False, method='dense')
df[['Text_A','Text_B', 'Jaccard']].loc[(df.Rank_Jaccard==1)&(df.Jaccard>0.3)]
 
Fuzzy Joins Tutorial 10

PollyFuzz Library

There exists a new library called PollyFuzz that I didn’t have time to work thoroughly on it. I will provide some examples from the documentation.

Currently, the following models are implemented in PolyFuzz:

  • TF-IDF
  • EditDistance (you can use any distance measure, see documentation)
  • FastText and GloVe
  • Transformers
from polyfuzz import PolyFuzz

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]


model = PolyFuzz("TF-IDF").match(from_list, to_list)
model.get_matches()
Fuzzy Joins Tutorial 11
model.group(link_min_similarity=0.75)
model.get_matches()
Fuzzy Joins Tutorial 12

You can find more examples here

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.