Working with Geospatial Vector Data

Posted on August 17, 2024 by The Pleasure of Finding Things Out: A blog by James Triveri in Data science | 0 Comments

This article was first published on The Pleasure of Finding Things Out: A blog by James Triveri , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Geospatial vector data is a way of representing geographic features in a digital format using points, lines, and polygons. Unlike raster data, which represents geographic data as a grid of cells or pixels, vector data represents features more precisely with distinct shapes and boundaries. Each vector feature can have associated attributes, such as names, types, or other descriptive information.

Types of Geospatial Vector Data

Points: Represent discrete locations such as cities, landmarks, or individual trees. Each point has a specific location defined by coordinates (e.g., latitude and longitude).
Lines (or polylines): Represent linear features such as roads, rivers, or boundaries. Lines are composed of a series of connected points.
Polygons (or multipolygons): Represent areas or shapes such as lakes, parks, or country borders. Polygons are defined by a series of points that create a closed shape.

Shapefiles are one of the most common formats for vector data. They store points, lines, and polygons along with attribute information. The US Census Bureau makes a number of shapefiles available here. In this notebook, we’ll walkthrough how to load shapefiles into GeoPandas, plotting the boundaries and create a choropleth map based on a second dataset (choropleth maps are those where the color of each shape is based on the value of an associated variable).

To start, download US state shapefile cb_2018_us_state_500k.zip from the United States Census Bureau boundary files page. Under the State subheader, you will see three files:

cb_2018_us_state_500k.zip
cb_2018_us_state_5m.zip
cb_2018_us_state_20m.zip

The 500k files are the most detailed, but also the largest. The 20m files are the smallest, but at the cost of some dramatic simplification. The 5m files fall somewhere between the other two. We will work with the 500k files.

Once downloaded, the shapefile can be loaded into a GeoPandas DataFrame as follows:

import numpy as np
import pandas as pd
import geopandas as gpd

shp_path = "cb_2018_us_state_500k.zip"

dfshp = gpd.read_file(shp_path)

dfshp.head(5)

	STATEFP	STATENS	AFFGEOID	GEOID	STUSPS	NAME	ALAND	AWATER	geometry
0	28	01779790	0400000US28	28	MS	Mississippi	121533519481	3926919758	MULTIPOLYGON (((-88.50297 30.21523, -88.49176 …
1	37	01027616	0400000US37	37	NC	North Carolina	125923656064	13466071395	MULTIPOLYGON (((-75.72681 35.93584, -75.71827 …
2	40	01102857	0400000US40	40	OK	Oklahoma	177662925723	3374587997	POLYGON ((-103.00257 36.52659, -103.00219 36.6…
3	51	01779803	0400000US51	51	VA	Virginia	102257717110	8528531774	MULTIPOLYGON (((-75.74241 37.80835, -75.74151 …
4	54	01779805	0400000US54	54	WV	West Virginia	62266474513	489028543	POLYGON ((-82.64320 38.16909, -82.64300 38.169…

The geometry column is a special column in a GeoDataFrame that stores the geometric shapes associated with each row (in this case, the shapes in latitude-longitude pairs that define the boundary of each state). This column contains the vector data that defines the spatial features in the dataset. Some states have boundaries defined by a MULTIPOLYGON, such as Hawaii, whose boundary consists of multiple closed POLYGONS. If it isn’t already present, the geometry column needs to be defined.

We can plot the data present in the present in the shapefile by calling the GeoDataFrame’s plot method:

dfshp.plot()

Let’s zoom in and focus on a map of the lower 48 states only:

exclude = ["American Samoa", "Alaska", "Hawaii", "Guam", "United States Virgin Islands",
           "Commonwealth of the Northern Mariana Islands", "Puerto Rico"]

dfshp48 = dfshp[~dfshp.NAME.isin(exclude)].reset_index(drop=True)

dfshp48.plot()

We can get a better view of the boundaries of each state by calling boundary.plot:

dfshp48.boundary.plot()

By default, the plots rendered via GeoPandas are smaller than we might like. We can increase the size of the rendered map, suppress ticklabels, change the boundary color and add a title as follows:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)

ax.set_title("U.S. Boundaries - Lower 48 States")
dfshp48.boundary.plot(ax=ax, edgecolor="red", linewidth=.50)

ax.axis("off")

plt.show()

To overlay the state name at the center of each state, use:

fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("U.S. Boundaries - Lower 48 States")
dfshp48.boundary.plot(ax=ax, edgecolor="red", linewidth=.50)
dfshp48.apply(lambda x: ax.annotate(x.NAME, xy=x.geometry.centroid.coords[0], ha='center', fontsize=6), axis=1)
ax.axis("off")
plt.show()

In the shapefile, ALAND and AWATER represent the land and water area of each state in square meters. To create a choropleth map based on the natural log of AWATER, include the column argument to the plot method:

# Compute natural log of AWATER to get better separation by state.
dfshp48["log_AWATER"] = np.log(dfshp48["AWATER"])
dfshp48.plot(column="log_AWATER", cmap="plasma")

We can reformat the map as before, while also adding a legend to give context the difference in colors by state. Options for colormaps are available here:

fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("Ln(AWATER) - Lower 48 States")
dfshp48.plot(
    ax=ax, column="log_AWATER", edgecolor="gray", linewidth=.50, 
    cmap="gist_rainbow", alpha=.750, legend=True,
    legend_kwds={"label": "Ln(AWATER)", "orientation": "vertical", "shrink": .35}
    )
ax.axis("off")
plt.show()

Congressional District Shapefiles

For variety, let’s download the Congressional District shapefile and plot the boundaries. It is available at the same link as above, and is identified as cb_2018_us_cd116_500k.zip. Reading the file into GeoPandas and displaying the first 5 rows yields:

dfc = gpd.read_file("cb_2018_us_cd116_500k.zip")

print(f"dfc.shape: {dfc.shape}")

dfc.head(5)

dfc.shape: (441, 9)

	STATEFP	CD116FP	AFFGEOID	GEOID	LSAD	CDSESSN	ALAND	AWATER	geometry
0	17	10	5001600US1710	1710	C2	116	777404163	31605644	POLYGON ((-88.19882 42.41557, -88.19860 42.415…
1	47	06	5001600US4706	4706	C2	116	16770155959	324676580	POLYGON ((-87.15023 36.56770, -87.14962 36.568…
2	48	06	5001600US4806	4806	C2	116	5564805243	255530191	POLYGON ((-97.38860 32.61731, -97.38856 32.618…
3	48	07	5001600US4807	4807	C2	116	419784487	3069802	POLYGON ((-95.77383 29.87515, -95.76962 29.875…
4	48	26	5001600US4826	4826	C2	116	2349987793	191353567	POLYGON ((-97.39826 32.99996, -97.39792 33.013…

We again display the boundaries:

dfc.boundary.plot()

We’d like to focus on the lower 48 states again, but this time the shapefile doesn’t have a NAME column. How should we proceed?

One approach is to define a bounding box that encloses the lower 48 states, then filter the shapefile to retain only those congressional districts whose geometry intersects the bounding box. GeoPandas provides coordinate based indexing with the cx indexer, which slices using a bounding box. Geometries in the GeoSeries or GeoDataFrame that intersect the bounding box will be returned.

For the lower 48 states bounding box, we’ll use (-125, 24.6), (-65, 50), southwest to northeast. We also include a circle marker at the center of each congressional district:

xmin, ymin, xmax, ymax = -125, 24.6, -65, 50
dfc48 = dfc.cx[xmin:xmax, ymin:ymax]



fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True)
ax.set_title("US Congressional Districts, 116th Congress - Lower 48 States", fontsize=11)
dfc48.boundary.plot(ax=ax, edgecolor="black", linewidth=.50)
dfc48.geometry.centroid.plot(ax=ax, markersize=6, color="red")
ax.axis("off")
plt.show()

C:\Users\jtriv\AppData\Local\Temp\ipykernel_8996\3296541533.py:9: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  dfc48.geometry.centroid.plot(ax=ax, markersize=6, color="red")

GeoJSON

Working with GeoJSON is much the same as working with shapefiles, one difference being that with GeoJSON, vector data is contained within a single file as opposed to an archive of multiple file types. See here for an example.

But once read into GeoPandas, we work with it the same way. We can load US state boundary files as GeoJSON from GitHub via:

dfstate = gpd.read_file("https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/data/geojson/us-states.json")

dfstate.head()

	id	name	density	geometry
0	01	Alabama	94.650	POLYGON ((-87.35930 35.00118, -85.60667 34.984…
1	02	Alaska	1.264	MULTIPOLYGON (((-131.60202 55.11798, -131.5691…
2	04	Arizona	57.050	POLYGON ((-109.04250 37.00026, -109.04798 31.3…
3	05	Arkansas	56.430	POLYGON ((-94.47384 36.50186, -90.15254 36.496…
4	06	California	241.700	POLYGON ((-123.23326 42.00619, -122.37885 42.0…

dfstate.plot()

To leave a comment for the author, please follow the link and comment on their blog: The Pleasure of Finding Things Out: A blog by James Triveri .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers

Working with Geospatial Vector Data

Types of Geospatial Vector Data

Congressional District Shapefiles

GeoJSON

Related