In this article we will explore how to perform fuzzy string matching using Python.
Table of contents:
- Levenshtein Distance
- Simple Fuzzy String Matching
- Partial Fuzzy String Matching
- Out of Order Fuzzy String Matching
When working with strings matching or text analytics, we often want to find the matching parts within some variables or text. Looking at the text ourselves, we can tell that Toronto Airport and Airport Toronto are referring to the same thing, and that Torotno is just a misspelled Toronto.
But how can we solve this programmatically and have Python recognize these cases? We use fuzzy string matching!
To continue following this tutorial we will need the following Python libraries: fuzzywuzzy and python-Levenshtein.
If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:
pip install fuzzywuzzy pip install python-Levenshtein
In order to understand the underlying calculations behind the string matching, let’s discuss the Levenshtein distance.
Levenshtein distance, in computer science, is a metric of measurement of similarity between two sequences (in our case it’s strings). It is often referred to as “edit distance”.
How so? Simply think that it calculates the minimum number of edits that should take place between two strings to make them the same. Now, the less the number of required edits is, the more similar two strings are to each other.
To learn more about Levenshtein distance and its computation, check out this article.
Simple Fuzzy String Matching
The simple ratio approach from the fuzzywuzzy library computes the standard Levenshtein distance similarity ratio between two strings which is the process for fuzzy string matching using Python.
Let’s say we have two words that are very similar to each other (with some misspelling): Airport and Airprot. By just looking at these, we can tell that they are probably the same except the misspelling. Now let’s try to quantify the similarity using simple ratio string matching:
from fuzzywuzzy import fuzz string1 = "Airport" string2 = "Airprot" print(fuzz.ratio(string1, string2))
And we get:
So the computed similarity between the two words is 86% which is pretty good for a misspelled word.
This approach works fine for short strings and strings or relatively similar length, but not so well for strings of different lengths. For example, what do you think will be the similarity between Airport and Toronto Airport? It’s actually lower than you think:
from fuzzywuzzy import fuzz string1 = "Airport" string2 = "Toronto Airport" print(fuzz.ratio(string1, string2))
And we get:
Well what happens here is that the difference in the lengths of strings plays a role. Luckily, fuzzywuzzy library has a solution for it: .partial_ratio() method.
Partial Fuzzy String Matching
Recall from the section above that when comparing Airport with Toronto Airport, we only got 64% similarity with simple string matching. In fact, in both cases we are referring to an airport that’s what we will see as a reader as well.
Because of significantly different lengths of strings we should do partial string matching. What we are interesting in here is the best match of a shorter string to a longer string.
How does it work logically? Consider two strings: Airport and Toronto Airport. We can tell right away that the first string is a substring of a second string, that is Airport is a substring of Toronto Airport, which is a perfect match:
from fuzzywuzzy import fuzz string1 = "Airport" string2 = "Toronto Airport" print(fuzz.partial_ratio(string1, string2))
And we get:
Out of Order Fuzzy String Matching
A common problem we may face with the strings is the order of the words. For example, how similar do you think Airport Toronto is to Toronto Airport? 100%?
Using the techniques from the above sections, we find surprisingly low results:
from fuzzywuzzy import fuzz string1 = "Airport Toronto" string2 = "Toronto Airport" print(fuzz.ratio(string1, string2)) print(fuzz.partial_ratio(string1, string2))
And we get:
That is probably much lower than you would expect? It’s only 47%-48%.
What we find that it’s not only the the similarity of substrings that matters, but also their order.
Same Length Strings
For this case, fuzzywuzzy library has a solution for it: .token_sort_ratio() method. What it does is it tokenizes the strings, then sorts the tokens alphabetically, and then does the string matching.
In our example, tokenizing Airport Toronto will keep it the same way, but tokenizing Toronto Airport will alphabetically order the substrings to get Airport Toronto. Now we are comparing Airport Toronto to Airport Toronto and you can guess we will probably get 100% similarity:
from fuzzywuzzy import fuzz string1 = "Airport Toronto" string2 = "Toronto Airport" print(fuzz.token_sort_ratio(string1,string2))
And we get:
Different Length Strings
For this case, fuzzywuzzy library has a solution for it: .token_set_ratio() method. What it does is it tokenizes the strings, then splits then into [intersection] and [remainder], then sorts the strings in each group alphabetically, and then does the string matching.
Consider two strings: Airport Toronto and Toronto Airport Closed. In this case, the [intersection] group will be Airport Toronto, the [remainder] of the first string will be empty, and the [remainder] of the second string will be Closed.
Logically we can see that the score will be higher for the string pairs that have a larger [intersection] group since there will be a perfect match, and the variability comes from comparison of the [remainder] groups:
from fuzzywuzzy import fuzz string1 = "Airport Toronto" string2 = "Toronto Airport Closed" print(fuzz.token_set_ratio(string1,string2))
And we get:
In this article we explored how to perform fuzzy string matching using Python.
I also encourage you to check out my other posts on Python Programming.
Feel free to leave comments below if you have any questions or have suggestions for some edits.