Python Musings #3: Geocoding data with Selenium and Google Maps

[This article was first published on Python Musings – bensstats, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

For any organization or business the power of using Geographical Data can be a powerful insight regardless of it’s nature. Having a real understanding of an organization’s or business’s reach and using that knowledge as a readily available visualization has powerful applications for understanding their influence and impact.

For example- a quick look at my WordPress Dashboard gives me a quick and easy insight in understanding the reach of my little website so far. It’s pretty cool to see that!

I wouldn’t have any clue how big the reach of my blog is without this! (Circa August 26, 2020)

If you are working with data that is on the country, state/province or zip code level, geo-coding created by software like Tableau is great. However, if you want to get down to the street level- it can be more complicated as geo-codes at the street level are not as readily available.

In this blog post we are going to look at how to create a robust solution for getting properly formatted addresses and geo-codes from Google maps using selenium and some Regular expressions.

Let’s go!

The Problem

Suppose you have a simple database with addresses which have been manually inputted. How do you geo-code the address data to build a dashboard or plot a route for hundreds (or thousands) of addresses?

Here are some of the challenges:

  1. The addresses may not be complete. (i.e. missing a postal code)
  2. The addresses may have an error in them (i.e. misspelled postal code)
  3. The addresses aren’t formatted in a proper form- While readable to humans, inputting the data in a traditional geo-coding software will bring up errors.
  4. Communicating with an API can be confusing. Not everyone can afford to pay for one either.

These challenges are what stop a lot of businesses from even looking into having this insight available- even if their data is fully cleaned!

But what if I told you it’s possible to do this by extracting information from the links google maps gives us? And it can be automated with just simple for-loop?

Extracting Geo-codes from Google Maps

First things first: Setting up selenium

If you don’t have any background in using selenium, then I highly recommend that you check out first TheCodex‘s tutorial on selenium. It only takes 10 minutes to learn! Once you have selenium set up, using this code is possible and easy to understand.

This code is currently only compatible with ChromeDriver, but once you know the selenium basics, using a driver of your choice is relatively simple to do and does not change the code I wrote drastically (but does require a small alteration in the webdriver method used in this function)

How this function works: examining the Google maps’ link

What happens in this function is 5 things.

  1. First we open Google maps.
  2. We then search Google maps with the address(es) we have in our database- Google should be able to locate the address if we have them entered with some accuracy.
  3. We copy the properly formatted address as per Google has it (this is a way to validate the process).
  4. We copy the unique link Google maps gives us.
  5. We preform some Regex operations to extract the latitude and longitude values.

After all this is done the function returns your address formatted as it is found on google maps and your latitude and longitude variables in a list.

This is the actual code. You can see it on my Github here.
Snippet made courtesy of Carbon.

The code can also be iterated in a for loop also. For our example we’ll use a list of length 1.

We then get the following output. Note the differences between the initial input and the Google validated address. Google gives us the state and zip code as well, together with the latitude and longitude.

We are given as output the whole address, the latitude and longitude.

Some comments: This isn’t perfect

1. This function is slow.

While this function is really useful for getting your data geo-coded it is crucial to have sleeps implemented within your function. This is due to the fact that selenium is interacting with google maps as a user, thus web-page loading are slower than the actions python can execute. Not having this can lead to errors being created in your code.

I additionally put some try-except blocks to ensure that the code will run smoothly in loops and not crash when working with large amounts of data.

2. The current regex is Ad-Hoc.

Because the data I was working with was in the New York region, the regex I used is case specific to positive latitudes and negative longitudes. I tried to find a workaround, but it would warrant a rewrite on how I approached the problem (If you have a simpler solution- please comment so I can implement it!)

This is the link we extract our geo-codes are.

The current regex pattern being used to extract the geo-codes is:

@\d{1,3}\.\d{5,},-\d{1,3}\.\d{5,}

The Regex for separating the latitude and longitude is:

Latitude: (?<[email protected])\d{1,3}\.\d{5,}
Longitude: (?<=,)-\d{1,3}\.\d{5,}

3. Page responsiveness can be buggy.

If you are working with large amounts of data, you may encounter some bugs; There may be some addresses that will pull up errors or not be geo-coded. Yet, after a second iteration will successfully be geo-coded. The reason for this still remains to be a mystery for me- but my hunch is that this is because of the responsiveness of ChromeDriver and my machine.

The current workaround for this would be to save the data which was successfully geo-coded and rerun the function on the addresses which an error was passed. Its not the best- but will get the job done.

4. Did I forget to say- this is slow?

When testing this function on my “Starbucks locations in New York” data set I had to get properly formatted addresses for latitudes and longitudes for around 600+ stores. I didn’t time it but this took around 2 1/2 – 3 hours to get done.

So, if you are going to be doing this…

And come back to your computer when this is all done. It goes without saying you can’t use this technique for data points which are on the move.

With this all aside- having this solution is great and relatively simple and flexible to use.

A “Real World Example”: Mapping Starbucks Locations in New York

This data is a subset of the data I used for making my Coffee Stores in NY Tableau dashboard. Because this function is quite slow I only chose to look at Starbucks locations in New York.

By just looking a the screen grab of the map focusing around the Theater district in Manhattan, the results speak for themselves.

Check out the dashboard here

Why does Tableau’s geo-coding show fewer results?

Its no secret, Tableau does not geo-code to the street level, it only geo-codes to the zip code level. To geo-code to the street level, we would need to custom geo-code like the way we did or use a API/Geo-coding service to help us.

Despite all the bugs, the reason why I like this method is because it is a relatively low tech solution and puts the power of geo-coding into your hands and machine and the addresses do not need to be formatted perfectly for Google maps to respond appropriately.

How would you tackle a problem like this? Let me know in the comments below!

Thanks for Reading!

Did you like this content? Be sure to never miss an update and Subscribe!






To leave a comment for the author, please follow the link and comment on their blog: Python Musings – bensstats.

Want to share your content on python-bloggers? click here.