Ridgeline Plots: The Perfect Way to Visualize Data Distributions with Python

This article was first published on python – Better Data Science , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Aren’t you tired of drawing histograms or density plots for every variable segment? There’s an easier solution. Ridgeline plots are a go-to visualization for this type of problem. Yes, even for multiple variables at the same time.

Here’s what you’ll make today:

Ridgeline plots

Image 1 – What you’ll make today (image by author)

Reading feels like a nightmare? There’s an easy solution:

Let’s get straight to the business. Here’s how the article is structured:

Dataset loading and preparation

The dataset you’ll use today is called Rain in Australia, so please download it. You won’t use it to predict rain, as it says in the description, but to make visualizations. 

You’ll use only four columns:

  • Date – useful for extracting month information
  • Location – you’ll work only with Sydney data
  • MinTemp– minimum temperature for the day
  • MaxTemp– maximum temperature for the day

Before proceeding to dataset loading, there’s one library you need to install – joypy. It is used to make joyplots or ridgeline plots in Python:

pip install joypy

Here’s how to load in the dataset. Keep in mind that you only want the four mentioned columns:

The first couple of rows should look like this:

Head of Rain in Australia dataset

Image 2 – Head of Rain in Australia dataset (image by author)

Onto the preparation now. The to-do list is quite short:

  • Create a data frame sydney which has data only for this town
  • Ditch the Location column
  • Convert Date column to datetime64 type
  • Extract month names from the date

Here’s the code:

The dataset now looks like this:

Head of Rain in Australia dataset after modifications

Image 3 – Head of Rain in Australia dataset after modifications (image by author)

It’s starting to look good, but you’re not done yet. The dataset isn’t aware of the relationship between the months. As a result, ordering them on a chart is a nightmare.

Pandas has a CategoricalDtype class that can help you with this. You have to specify the ordering of the categories and make the conversion afterward. Here’s how:

Accessing the dtypes informs you the transformation was successful:

Dataset data types

Image 4 – Dataset data types (image by author)

You’re done, preparation-wise! Time to make some ridgeline plots.

Ridgeline plot for a single variable

Drawing a chart boils down to a single function call. Here’s the code you’ll need to make a ridgeline plot of maximum temperatures in Sydney:

You could ditch the first and last two lines if you don’t care about the title. A call to joyplot() is enough. 

Here’s how the visualization looks like:

Ridgeline plot for max temperatures in Sydney

Image 5- Ridgeline plot for max temperatures in Sydney (image by author)

It took me a moment to realize nothing is wrong with the visualization. The dataset contains temperature data for Australia. The seasons there are opposite from the ones in the northern hemisphere.

Let’s see how to make things more complex by introducing a second variable to the plot.

Ridgeline plot for multiple variables

In addition to plotting distributions for max temperatures, you’ll now include the min temperature. Once again, thejoyplot library makes it easy:

Here’s how the visualization looks like:

Ridgeline plot for min and max temperatures in Sydney

Image 6- Ridgeline plot for min and max temperatures in Sydney (image by author)

Take a moment to appreciate how much information is shown on this single chart. It would take you 24 density plots for the most naive approach, and comparisons wouldn’t be nearly as easy.

Let’s wrap things up next.

Conclusion

And that’s ridgeline plots in a nutshell. You could do more – like coloring the area under the curve by some variable. The official documentation is packed with examples – explore it if you have the time.

To summarize – use ridgeline plots whenever you need to visualize distributions of variables and their segments in a compact way. Drawing histograms and density plots manually for variable segments is something you should avoid.

Join my private email list for more helpful insights.

The post Ridgeline Plots: The Perfect Way to Visualize Data Distributions with Python appeared first on Better Data Science.

To leave a comment for the author, please follow the link and comment on their blog: python – Better Data Science .

Want to share your content on python-bloggers? click here.