Ridgeline Plots: The Perfect Way to Visualize Data Distributions with Python

Posted on December 1, 2020 by Dario Radečić in Data science | 0 Comments

This article was first published on python – Better Data Science , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Aren’t you tired of drawing histograms or density plots for every variable segment? There’s an easier solution. Ridgeline plots are a go-to visualization for this type of problem. Yes, even for multiple variables at the same time.

Here’s what you’ll make today:

Image 1 – What you’ll make today (image by author)

Reading feels like a nightmare? There’s an easy solution:

Let’s get straight to the business. Here’s how the article is structured:

Dataset loading and preparation

The dataset you’ll use today is called Rain in Australia, so please download it. You won’t use it to predict rain, as it says in the description, but to make visualizations.

You’ll use only four columns:

Date – useful for extracting month information
Location – you’ll work only with Sydney data
MinTemp– minimum temperature for the day
MaxTemp– maximum temperature for the day

Before proceeding to dataset loading, there’s one library you need to install – joypy. It is used to make joyplots or ridgeline plots in Python:

pip install joypy

Here’s how to load in the dataset. Keep in mind that you only want the four mentioned columns:

The first couple of rows should look like this:

Image 2 – Head of Rain in Australia dataset (image by author)

Onto the preparation now. The to-do list is quite short:

Create a data frame sydney which has data only for this town
Ditch the Location column
Convert Date column to datetime64 type
Extract month names from the date

Here’s the code:

The dataset now looks like this:

Image 3 – Head of Rain in Australia dataset after modifications (image by author)

It’s starting to look good, but you’re not done yet. The dataset isn’t aware of the relationship between the months. As a result, ordering them on a chart is a nightmare.

Pandas has a CategoricalDtype class that can help you with this. You have to specify the ordering of the categories and make the conversion afterward. Here’s how:

Accessing the dtypes informs you the transformation was successful:

Image 4 – Dataset data types (image by author)

You’re done, preparation-wise! Time to make some ridgeline plots.

Ridgeline plot for a single variable

Drawing a chart boils down to a single function call. Here’s the code you’ll need to make a ridgeline plot of maximum temperatures in Sydney:

You could ditch the first and last two lines if you don’t care about the title. A call to joyplot() is enough.

Here’s how the visualization looks like:

Image 5- Ridgeline plot for max temperatures in Sydney (image by author)

It took me a moment to realize nothing is wrong with the visualization. The dataset contains temperature data for Australia. The seasons there are opposite from the ones in the northern hemisphere.

Let’s see how to make things more complex by introducing a second variable to the plot.

Ridgeline plot for multiple variables

In addition to plotting distributions for max temperatures, you’ll now include the min temperature. Once again, thejoyplot library makes it easy:

Here’s how the visualization looks like:

Image 6- Ridgeline plot for min and max temperatures in Sydney (image by author)

Take a moment to appreciate how much information is shown on this single chart. It would take you 24 density plots for the most naive approach, and comparisons wouldn’t be nearly as easy.

Let’s wrap things up next.

Conclusion

And that’s ridgeline plots in a nutshell. You could do more – like coloring the area under the curve by some variable. The official documentation is packed with examples – explore it if you have the time.

To summarize – use ridgeline plots whenever you need to visualize distributions of variables and their segments in a compact way. Drawing histograms and density plots manually for variable segments is something you should avoid.

Join my private email list for more helpful insights.

The post Ridgeline Plots: The Perfect Way to Visualize Data Distributions with Python appeared first on Better Data Science.

To leave a comment for the author, please follow the link and comment on their blog: python – Better Data Science .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers