Ridgeline Plots: The Perfect Way to Visualize Data Distributions with Python
Want to share your content on python-bloggers? click here.
Aren’t you tired of drawing histograms or density plots for every variable segment? There’s an easier solution. Ridgeline plots are a go-to visualization for this type of problem. Yes, even for multiple variables at the same time.
Here’s what you’ll make today:
Reading feels like a nightmare? There’s an easy solution:
Let’s get straight to the business. Here’s how the article is structured:
- Dataset loading and preparation
- Ridgeline plot for a single variable
- Ridgeline plot for multiple variables
- Conclusion
Dataset loading and preparation
The dataset you’ll use today is called Rain in Australia, so please download it. You won’t use it to predict rain, as it says in the description, but to make visualizations.
You’ll use only four columns:
Date
– useful for extracting month informationLocation
– you’ll work only with Sydney dataMinTemp
– minimum temperature for the dayMaxTemp
– maximum temperature for the day
Before proceeding to dataset loading, there’s one library you need to install – joypy
. It is used to make joyplots or ridgeline plots in Python:
pip install joypy
Here’s how to load in the dataset. Keep in mind that you only want the four mentioned columns:
The first couple of rows should look like this:
Onto the preparation now. The to-do list is quite short:
- Create a data frame
sydney
which has data only for this town - Ditch the
Location
column - Convert
Date
column todatetime64
type - Extract month names from the date
Here’s the code:
The dataset now looks like this:
It’s starting to look good, but you’re not done yet. The dataset isn’t aware of the relationship between the months. As a result, ordering them on a chart is a nightmare.
Pandas has a CategoricalDtype
class that can help you with this. You have to specify the ordering of the categories and make the conversion afterward. Here’s how:
Accessing the dtypes
informs you the transformation was successful:
You’re done, preparation-wise! Time to make some ridgeline plots.
Ridgeline plot for a single variable
Drawing a chart boils down to a single function call. Here’s the code you’ll need to make a ridgeline plot of maximum temperatures in Sydney:
You could ditch the first and last two lines if you don’t care about the title. A call to joyplot()
is enough.
Here’s how the visualization looks like:
It took me a moment to realize nothing is wrong with the visualization. The dataset contains temperature data for Australia. The seasons there are opposite from the ones in the northern hemisphere.
Let’s see how to make things more complex by introducing a second variable to the plot.
Ridgeline plot for multiple variables
In addition to plotting distributions for max temperatures, you’ll now include the min temperature. Once again, thejoyplot
library makes it easy:
Here’s how the visualization looks like:
Take a moment to appreciate how much information is shown on this single chart. It would take you 24 density plots for the most naive approach, and comparisons wouldn’t be nearly as easy.
Let’s wrap things up next.
Conclusion
And that’s ridgeline plots in a nutshell. You could do more – like coloring the area under the curve by some variable. The official documentation is packed with examples – explore it if you have the time.
To summarize – use ridgeline plots whenever you need to visualize distributions of variables and their segments in a compact way. Drawing histograms and density plots manually for variable segments is something you should avoid.
Join my private email list for more helpful insights.
The post Ridgeline Plots: The Perfect Way to Visualize Data Distributions with Python appeared first on Better Data Science.
Want to share your content on python-bloggers? click here.