No API? No Problem – Mapping Coordinates to ZIP Codes for Free

This article was first published on The Pleasure of Finding Things Out: A blog by James Triveri , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

ZIP Codes are widely used in analytics because they serve as a convenient proxy for geographic location, even though they are not true areal units. In the United States, ZIP Codes are created and managed by the United States Postal Service for the purpose of delivering mail. They represent collections of delivery routes (not enclosed polygons), and because these routes can change over time and don’t necessarily align with geographic or administrative boundaries, ZIP Codes do not have precise, fixed geographic boundaries.

The US Census Bureau releases ZIP Code Tabulation Area (ZCTA) shapefiles, which are generalized areal representations of the geographic extent and distribution of point-based ZIP Codes built using census tabulation blocks. Note however, ZCTAs are not the same as ZIP Codes, and not all valid ZIP Codes are represented by a ZCTA.

Despite these shortcomings, ZIP Codes can still reflect meaningful differences in building stock, emergency services and environmental hazards. They are also typically small and granular enough to allow for regional differentiation, and large enough to avoid some of the volatility seen at finer geographic levels like census block. Additionally, ZIP Codes are commonly used by regulators and in industry datasets, which makes them a practical choice for pricing, underwriting and reporting.

In this article, I’ll walk through two approaches that can be used to assign 5-digit ZIP Code to a sample collection of coordinate pairs without having to resort to a paid API.

OpenAddresses

OpenAddresses is an open data project that provides free, worldwide address data collected from government sources, open datasets, and community contributions. It includes individual address records containing street names, cities, ZIP Codes, longitude and latitude and sometimes administrative areas. The data is organized by country and region and is regularly updated. It is especially useful for geocoding, reverse geocoding, and location enrichment tasks where access to millions of structured address points is required.

For US locations, the data are divided into regions. For this demonstration, our goal is to assign a single zip code (not ZCTA) to 10 longitude latitude pairs falling somewhere in Iowa, so we’ll download us_midwest.zip, which is ~4GB, which expands to about

Within the zip archive, we are interested in the statewide.csv files. For Iowa, the exact path would be something like …/us/ia/statewide.csv. We read this file into a Pandas DataFrame and inspect the first few records:

%load_ext watermark

%watermark --python --conda --hostname --machine --iversions
import warnings

import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

np.set_printoptions(suppress=True, precision=8, linewidth=1000)
pd.options.mode.chained_assignment = None
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
warnings.filterwarnings("ignore")

# Example Iowa statewide file. Data available at https://results.openaddresses.io/
# dfoa = pd.read_csv(
#     "openaddr-collected-us_midwest/us/ia/statewide.csv"
#     r"C:\Users\jtriv\Downloads\collection-us-midwest\us\ia\statewide-addresses-state.geojson"
# )

oa = gpd.read_file(
    r"C:\Users\jtriv\Downloads\collection-us-midwest\us\ia\statewide-addresses-state.geojson"
)

print(f"oa.shape: {oa.shape}")
oa.head(10)

Ultimately a nearest-neighbor lookup will be used to assign a 5-digit ZIP Code to the 10 sample points using LON and LAT associated with known addresses, therefore we drop records with missing values for LON, LAT or POSTCODE and perform a bit of additional clean-up. In the last step, the original OpenAddresses dataset is converted to a GeoDataFrame:

# Clean and filter.
dfoa = dfoa.dropna(subset=["LON", "LAT", "POSTCODE"])
dfoa["POSTCODE"] = dfoa["POSTCODE"].astype(str).str.strip()
dfoa = dfoa[dfoa.POSTCODE!=""].reset_index(drop=True)

# Retain ZIP5 only (drop the plus four code).
dfoa["POSTCODE"] = dfoa["POSTCODE"].astype(str).str.zfill(5)
dfoa["ZIP5"] = dfoa["POSTCODE"].str.split("-", n=1, expand=True)[0]
dfoa = dfoa[["LON", "LAT", "ZIP5"]]

# Create GeoDataFrame.
oa = gpd.GeoDataFrame(
    dfoa,
    geometry=gpd.points_from_xy(dfoa.LON, dfoa.LAT),
    crs="EPSG:4326"
)

print(f"Iowa addresses in OpenAddresses dataset: {len(oa):,}")

oa.head()

We create a sample of 10 coordinate pairs that fall within the Iowa state boundary:

# Sample points of interest. All fall within Iowa state boundary.
pois = pd.DataFrame({
    "id": np.arange(10),
    "lon": [-91.4239, -92.5043, -90.6057, -93.6110, -93.1085,
            -90.6103, -95.8953, -92.4293, -93.5519, -95.0369],
    "lat": [41.9252, 40.6608, 41.5667, 42.0368, 41.3333, 
            41.5861, 41.2213, 42.4960, 41.6058, 40.7294]
    }
)

# Create GeoDataFrame.
pois = gpd.GeoDataFrame(
    pois,
    geometry=gpd.points_from_xy(pois.lon, pois.lat),
    crs="EPSG:4326"
)

pois.head(10)

Let’s visualize the 10 points to get an idea of their distribution across the state:

import folium

f = folium.Figure(width=700, height=500)

lats, lons = pois.lat.tolist(), pois.lon.tolist()
coords = list(zip(lats, lons))

# Center map on central Iowa. 
mid_lon, mid_lat = -93.0977, 41.8780
 
m = folium.Map(
    location=[mid_lat, mid_lon], 
    zoom_start=7
).add_to(f)

for lat, lon in coords:

    folium.CircleMarker(
        location=[lat, lon], 
        radius=5, 
        color="red", 
        fill_color="red", 
        fill=True,
        fill_opacity=1
        ).add_to(m)

m

Within GeoPandas, sjoin_nearest performs a spatial join that matches each geometry in one GeoDataFrame to the nearest geometry in another GeoDataFrame based on geometric distance. Instead of checking for intersection or containment like a standard spatial join, sjoin_nearest finds the closest match in terms of Euclidean distance (in projected coordinates) or great-circle distance if using geographic coordinates.

In our example, each of the 10 points will get assigned the ZIP Code associated with the nearest known address point based on great-circle distance:

# Perform nearest neighbor lookup to known lon-lat zip code.
pois = gpd.sjoin_nearest(
    pois,
    oa[["ZIP5", "geometry"]],
    how="left",
    distance_col="dist_meters"
)

pois.head()

In the resulting join, index_right specifies which row index from the OpenAddresses dataset served as the nearest match to the sample point.

Since our sample data is represented using a geographic coordinate reference system, dist_meters represents the distance to the nearest known address in terms of decimal degrees, not meters (even if we named the column "dist_meters"). If accurate distance measurements to the nearest known address are required, it is necessary to re-project the data to a projected CRS, then re-execute the spatial join:

# Using a projected CRS.
pois = pd.DataFrame({
    "id": np.arange(10),
    "lon": [-91.4239, -92.5043, -90.6057, -93.6110, -93.1085,
            -90.6103, -95.8953, -92.4293, -93.5519, -95.0369],
    "lat": [41.9252, 40.6608, 41.5667, 42.0368, 41.3333, 
            41.5861, 41.2213, 42.4960, 41.6058, 40.7294]
    }
)

# Create GeoDataFrame.
pois = gpd.GeoDataFrame(
    pois,
    geometry=gpd.points_from_xy(pois.lon, pois.lat),
    crs="EPSG:4326"
)

# Re-project to a projected CRS for accurate distance calculations.
pois = pois.to_crs("EPSG:3857")

# Ensure same CRS for OpenAddresses data.
oa = oa.to_crs("EPSG:3857")

pois = gpd.sjoin_nearest(
    pois, 
    oa[["ZIP5", "geometry"]], 
    distance_col="dist_meters"
)

pois.head(10)

For the first sample (id = 0), dist_meters indicates the nearest known address is ~20 meters away.

Finding k-Nearest ZIP Codes with BallTree

sjoin_nearest performs a spatial join by finding the nearest geometry in one GeoDataFrame for each geometry in another using a spatial index. When sjoin_nearest is called, it uses the spatial index to find candidate nearest geometries efficiently, then refines the match using actual geometric distance.

One limitation of using sjoin_nearest is that only a single ZIP Code will be assigned to each sample. There may be cases where knowing the ZIP Codes of the five or ten nearest addresses might be of interest, possibly as an indicator of the homogeneity of ZIP Codes in a given region: If the 10 nearest addresses all share the same ZIP Code, it is likely that the sample point shares that ZIP Code (assuming a relatively small distance to each address).

We can identify and retrieve ZIP Codes associated with the k-nearest addresses and corresponding distances using a BallTree. Within scikit-learn, BallTree implements a fast nearest-neighbor search using a space-partitioning tree. It is not geospatially aware and does not work directly with geometry objects, and instead works with raw coordinate arrays. In the next cell, we create a BallTree instance, and identify the 10 nearest ZIP Codes for each sample in pois. Note that BallTree with haversine distance expects input in radians, not degrees:

from sklearn.neighbors import BallTree

# Convert lon/lat to radians for BallTree when using haversine 
# metric.
# OpenAddresses data.
oa_coords = np.deg2rad(oa[["LON", "LAT"]].values)

# 10 sample data points. 
poi_coords = np.deg2rad(pois[["lon", "lat"]].values)

# Create BallTree instance.
bt = BallTree(oa_coords, metric="haversine")

# Query top 10 neighbors.
dists, indices = bt.query(poi_coords, k=10)

print(f"dists:\n{dists}\n")

print(f"indices:\n{indices}\n")

indices is a 10 x 10 array representing the indices into the original OpenAddresses GeoDataFrame corresponding to the 10 nearest addresses. We can lookup the associated ZIP Codes as follows:

# Lookup ZIP Codes for first sample (row 0)

idx = indices[0]

oa.loc[idx, "ZIP5"].tolist()

For the first point of interest, the 10 nearest ZIP Codes are the same, indicating a degree of homegeneity for that region.

dists is also a 10 x 10 array representing the distances to each of the 10 nearest addresses. When using the haversine metric, the distances will be returned in radians. To get distance in meters, multiply by 6,371,000 (assumes an Earth radius of 6,371km):

# Convert radians to meters.
dists*= 6_371_000  

dists.round(2)

Notice the distances increase from left to right. The nearest neighboring address will be at index 0 for each row, the furthest at the last index.

Let’s attach the ZIP Codes and distances to the original pois GeoDataFrame:

# Add ZIP Codes and distances to pois GeoDataFrame.
pois["zip5_bt"] = [oa.loc[ii, "ZIP5"].tolist() for ii in indices]
pois["dist_bt"] = [dists[ii].tolist() for ii in range(len(dists))]

pois.head(10)

We confirm that the ZIP Code identified using sjoin_nearest (ZIP5) matches the nearest address ZIP Code for each of the 10 points of interest.

As final diagnostic, we can visualize all the points from the OpenAddresses dataset that map to one of our POIs nearest ZIP Codes, along with the convex hull enclosing those points. This will provide additional context missing from a purely distance based diagnostic. We focus on the point with id = 1, and perform the following steps:

  • Filter oa to addresses having ZIP Code = 52537.
  • Combine resulting geometries using .unary_union.
  • Call .convex_hull to compute the convex hull polygon enclosing the result from the previous step.
  • Visualize address points, convex hull and original point of interest.
# Filter oa to addresses having ZIP Code = 52537.
# Repoject to WGS84 for visualization.
oa52537 = oa[oa.ZIP5=="52537"].to_crs("EPSG:4326").reset_index(drop=True)

print(f"Addresses in ZIP Code 52537: {len(oa52537):,}")

oa52537.head(10)
# Get convex hull for ZIP Code = 52537.
cvh = oa52537.unary_union.convex_hull

# Get POI lat and lon.
poi_y = pois[pois.id==1]["lat"].item()
poi_x = pois[pois.id==1]["lon"].item()

# Initialize map.
f = folium.Figure(width=800, height=525)
m = folium.Map(location=[poi_y, poi_x], zoom_start=10).add_to(f)


folium.Marker(
    location=[poi_y, poi_x],
    icon=folium.Icon(color="red"),
    popup=f"id=1: {poi_x} {poi_y}"
    ).add_to(m)

# Add address points to map.
for point in oa52537.geometry:
    folium.CircleMarker(
        location=[point.y, point.x],
        radius=3.5,
        color="#4f9de4",
        fill=True,
        weight=1,
        fill_opacity=0.35
    ).add_to(m)


# Add the convex hull polygon to the map.
folium.Polygon(
    locations=[[lat, lon] for lon, lat in cvh.exterior.coords],
    color="#184771",
    fill=False
).add_to(m)


m

The POI (red marker) is surrounded by many addresses with the same ZIP Code. This provides additional confidence that the assignment is reasonable and relatively accurate.

Conclusion

  • ZCTAs <> ZIP Codes! They are not one-to-one. Some ZIP Codes don’t have ZCTA equivalents. ZCTAs are statistical areas and do not map exactly to USPS ZIP Codes. ZIp Codes represent collections of delivery routes (not enclosed polygons).
  • ZCTAs are updated only with each decennial census, while USPS updates ZIP Codes constantly.
  • ZIP Codes that span multiple counties or regions may be misrepresented in ZCTA form.
  • Even though the OpenAddresses dataset contains over 500 million addresses, coverage is variable, and is generally better in urban/suburban areas than in rural areas. Some large metropolitan areas may be missing address data altogether.
  • If you’re building something regulatory or production-facing, you should disclose that the USPS does not publish ZIP Code boundaries, and this method uses nearest inferred ZIP Codes from publicly available data.
To leave a comment for the author, please follow the link and comment on their blog: The Pleasure of Finding Things Out: A blog by James Triveri .

Want to share your content on python-bloggers? click here.