Working with Geospatial Vector Data
Want to share your content on python-bloggers? click here.
Geospatial vector data is a way of representing geographic features in a digital format using points, lines, and polygons. Unlike raster data, which represents geographic data as a grid of cells or pixels, vector data represents features more precisely with distinct shapes and boundaries. Each vector feature can have associated attributes, such as names, types, or other descriptive information.
Types of Geospatial Vector Data
Points: Represent discrete locations such as cities, landmarks, or individual trees. Each point has a specific location defined by coordinates (e.g., latitude and longitude).
Lines (or polylines): Represent linear features such as roads, rivers, or boundaries. Lines are composed of a series of connected points.
Polygons (or multipolygons): Represent areas or shapes such as lakes, parks, or country borders. Polygons are defined by a series of points that create a closed shape.
Shapefiles are one of the most common formats for vector data. They store points, lines, and polygons along with attribute information. The US Census Bureau makes a number of shapefiles available here. In this notebook, we’ll walkthrough how to load shapefiles into GeoPandas, plotting the boundaries and create a choropleth map based on a second dataset (choropleth maps are those where the color of each shape is based on the value of an associated variable).
To start, download US state shapefile cb_2018_us_state_500k.zip from the United States Census Bureau boundary files page. Under the State subheader, you will see three files:
- cb_2018_us_state_500k.zip
- cb_2018_us_state_5m.zip
- cb_2018_us_state_20m.zip
The 500k files are the most detailed, but also the largest. The 20m files are the smallest, but at the cost of some dramatic simplification. The 5m files fall somewhere between the other two. We will work with the 500k files.
Once downloaded, the shapefile can be loaded into a GeoPandas DataFrame as follows:
import numpy as np import pandas as pd import geopandas as gpd shp_path = "cb_2018_us_state_500k.zip" dfshp = gpd.read_file(shp_path) dfshp.head(5)
STATEFP | STATENS | AFFGEOID | GEOID | STUSPS | NAME | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 28 | 01779790 | 0400000US28 | 28 | MS | Mississippi | 00 | 121533519481 | 3926919758 | MULTIPOLYGON (((-88.50297 30.21523, -88.49176 … |
1 | 37 | 01027616 | 0400000US37 | 37 | NC | North Carolina | 00 | 125923656064 | 13466071395 | MULTIPOLYGON (((-75.72681 35.93584, -75.71827 … |
2 | 40 | 01102857 | 0400000US40 | 40 | OK | Oklahoma | 00 | 177662925723 | 3374587997 | POLYGON ((-103.00257 36.52659, -103.00219 36.6… |
3 | 51 | 01779803 | 0400000US51 | 51 | VA | Virginia | 00 | 102257717110 | 8528531774 | MULTIPOLYGON (((-75.74241 37.80835, -75.74151 … |
4 | 54 | 01779805 | 0400000US54 | 54 | WV | West Virginia | 00 | 62266474513 | 489028543 | POLYGON ((-82.64320 38.16909, -82.64300 38.169… |
The geometry column is a special column in a GeoDataFrame that stores the geometric shapes associated with each row (in this case, the shapes in latitude-longitude pairs that define the boundary of each state). This column contains the vector data that defines the spatial features in the dataset. Some states have boundaries defined by a MULTIPOLYGON, such as Hawaii, whose boundary consists of multiple closed POLYGONS. If it isn’t already present, the geometry column needs to be defined.
We can plot the data present in the present in the shapefile by calling the GeoDataFrame’s plot
method:
dfshp.plot()
Let’s zoom in and focus on a map of the lower 48 states only:
exclude = ["American Samoa", "Alaska", "Hawaii", "Guam", "United States Virgin Islands", "Commonwealth of the Northern Mariana Islands", "Puerto Rico"] dfshp48 = dfshp[~dfshp.NAME.isin(exclude)].reset_index(drop=True) dfshp48.plot()
We can get a better view of the boundaries of each state by calling boundary.plot
:
dfshp48.boundary.plot()
By default, the plots rendered via GeoPandas are smaller than we might like. We can increase the size of the rendered map, suppress ticklabels, change the boundary color and add a title as follows:
import matplotlib.pyplot as plt fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True) ax.set_title("U.S. Boundaries - Lower 48 States") dfshp48.boundary.plot(ax=ax, edgecolor="red", linewidth=.50) ax.axis("off") plt.show()
To overlay the state name at the center of each state, use:
fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True) ax.set_title("U.S. Boundaries - Lower 48 States") dfshp48.boundary.plot(ax=ax, edgecolor="red", linewidth=.50) dfshp48.apply(lambda x: ax.annotate(x.NAME, xy=x.geometry.centroid.coords[0], ha='center', fontsize=6), axis=1) ax.axis("off") plt.show()
In the shapefile, ALAND and AWATER represent the land and water area of each state in square meters. To create a choropleth map based on the natural log of AWATER, include the column
argument to the plot
method:
# Compute natural log of AWATER to get better separation by state. dfshp48["log_AWATER"] = np.log(dfshp48["AWATER"]) dfshp48.plot(column="log_AWATER", cmap="plasma")
We can reformat the map as before, while also adding a legend to give context the difference in colors by state. Options for colormaps are available here:
fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True) ax.set_title("Ln(AWATER) - Lower 48 States") dfshp48.plot( ax=ax, column="log_AWATER", edgecolor="gray", linewidth=.50, cmap="gist_rainbow", alpha=.750, legend=True, legend_kwds={"label": "Ln(AWATER)", "orientation": "vertical", "shrink": .35} ) ax.axis("off") plt.show()
Congressional District Shapefiles
For variety, let’s download the Congressional District shapefile and plot the boundaries. It is available at the same link as above, and is identified as cb_2018_us_cd116_500k.zip. Reading the file into GeoPandas and displaying the first 5 rows yields:
dfc = gpd.read_file("cb_2018_us_cd116_500k.zip") print(f"dfc.shape: {dfc.shape}") dfc.head(5)
dfc.shape: (441, 9)
STATEFP | CD116FP | AFFGEOID | GEOID | LSAD | CDSESSN | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|
0 | 17 | 10 | 5001600US1710 | 1710 | C2 | 116 | 777404163 | 31605644 | POLYGON ((-88.19882 42.41557, -88.19860 42.415… |
1 | 47 | 06 | 5001600US4706 | 4706 | C2 | 116 | 16770155959 | 324676580 | POLYGON ((-87.15023 36.56770, -87.14962 36.568… |
2 | 48 | 06 | 5001600US4806 | 4806 | C2 | 116 | 5564805243 | 255530191 | POLYGON ((-97.38860 32.61731, -97.38856 32.618… |
3 | 48 | 07 | 5001600US4807 | 4807 | C2 | 116 | 419784487 | 3069802 | POLYGON ((-95.77383 29.87515, -95.76962 29.875… |
4 | 48 | 26 | 5001600US4826 | 4826 | C2 | 116 | 2349987793 | 191353567 | POLYGON ((-97.39826 32.99996, -97.39792 33.013… |
We again display the boundaries:
dfc.boundary.plot()
We’d like to focus on the lower 48 states again, but this time the shapefile doesn’t have a NAME column. How should we proceed?
One approach is to define a bounding box that encloses the lower 48 states, then filter the shapefile to retain only those congressional districts whose geometry intersects the bounding box. GeoPandas provides coordinate based indexing with the cx
indexer, which slices using a bounding box. Geometries in the GeoSeries or GeoDataFrame that intersect the bounding box will be returned.
For the lower 48 states bounding box, we’ll use (-125, 24.6), (-65, 50), southwest to northeast. We also include a circle marker at the center of each congressional district:
xmin, ymin, xmax, ymax = -125, 24.6, -65, 50 dfc48 = dfc.cx[xmin:xmax, ymin:ymax] fig, ax = plt.subplots(1, 1, figsize=(10, 8), tight_layout=True) ax.set_title("US Congressional Districts, 116th Congress - Lower 48 States", fontsize=11) dfc48.boundary.plot(ax=ax, edgecolor="black", linewidth=.50) dfc48.geometry.centroid.plot(ax=ax, markersize=6, color="red") ax.axis("off") plt.show()
C:\Users\jtriv\AppData\Local\Temp\ipykernel_8996\3296541533.py:9: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation. dfc48.geometry.centroid.plot(ax=ax, markersize=6, color="red")
GeoJSON
Working with GeoJSON is much the same as working with shapefiles, one difference being that with GeoJSON, vector data is contained within a single file as opposed to an archive of multiple file types. See here for an example.
But once read into GeoPandas, we work with it the same way. We can load US state boundary files as GeoJSON from GitHub via:
dfstate = gpd.read_file("https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/data/geojson/us-states.json") dfstate.head()
id | name | density | geometry | |
---|---|---|---|---|
0 | 01 | Alabama | 94.650 | POLYGON ((-87.35930 35.00118, -85.60667 34.984… |
1 | 02 | Alaska | 1.264 | MULTIPOLYGON (((-131.60202 55.11798, -131.5691… |
2 | 04 | Arizona | 57.050 | POLYGON ((-109.04250 37.00026, -109.04798 31.3… |
3 | 05 | Arkansas | 56.430 | POLYGON ((-94.47384 36.50186, -90.15254 36.496… |
4 | 06 | California | 241.700 | POLYGON ((-123.23326 42.00619, -122.37885 42.0… |
dfstate.plot()
Want to share your content on python-bloggers? click here.