Python’s Pandas vs. R’s dplyr – Which Is The Best Data Analysis Library

Dario Radečić

5 years ago

This article was first published on python – Appsilon | End to End Data Science Solutions , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Pandas vs. dplyr

It’s difficult to find the ultimate go-to library for data analysis. Both R and Python provide excellent options, so the question quickly becomes “which data analysis library is the most convenient”. Today’s article aims to answer this question, assuming you’re equally skilled in both languages.

Looking for more Python and R comparisons? Check out our Python Dash vs. R Shiny comparison.

As mentioned earlier, this article assumes you are equally skilled in both R and Python. If that’s not the case, it’s likely your decision will be biased, as people tend to approve more of the familiar technologies. We’ll try to provide a completely unbiased opinion based on facts and code comparisons.

The article is structured as follows:

Data Loading

There’s no data analysis without data. Both Pandas and dplyr can connect to virtually any data source, and read from any file format. That’s why we won’t spend any time exploring connection options but will use a build-in dataset instead.

Here’s how you can load Pandas and the Gapminder dataset with Python and Pandas:

The results are shown below:

Image 1 – Library and dataset loading with Pandas

And here’s how you can do the same with R and dplyr:

Here are the results:

Image 2 – Library and dataset loading with dplyr

There’s no winner in this department, as both libraries are near identical with the syntax.

Winner – tie.

Filtering

This is where things get a bit more interesting. The dplyr package is well-known for its pipe operator (%>%), which you can use to chain operations. This operator makes data drill-downs both easy to write and to read. On the other hand, Pandas doesn’t have such an operator.

Let’s go through three problem sets and see how both libraries compare.

Problem 1 – find records for the most recent year (2007).

Here’s how to do so with Pandas:

Image 3 – Records from 2007 (Pandas)

And here’s how to do the same with dplyr:

Image 4 – Records from 2007 (dplyr)

As you can see, both libraries are near equal when it comes to simple filtering. It’s common to use the filter() function with dplyr and bracket notation with Pandas. There are other options, sure, but you’ll see these most commonly.

Problem 2 – find records from the most recent year (2007) only for North and South Americas.

Still a pretty simple task, but let’s see the differences in code. Pandas comes first:

Image 5 – Records from 2007 for North and South Americas (Pandas)

And here’s how to do the same with dplyr:

Image 6 – Records from 2007 for North and South Americas (dplyr)

Applying multiple filters is much easier with dplyr than with Pandas. You can separate conditions with a comma inside a single filter() function. Pandas requires more typing and produces code that’s harder to read.

Problem 3 – find records from the most recent year (2007) only for the United States.

Let’s add yet another filter condition. The Pandas library comes first:

Image 7 – Records from 2007 for the United States (Pandas)

And here’s how to do the same with dplyr:

Image 8 – Records from 2007 for the United States (dplyr)

In a nutshell, Pandas is still tough to write, but you can put every filter condition on a separate line so it’s easier to read.

Winner – dplyr. Filtering is more intuitive and easier to read.

Summary Statistics

One of the most common data analysis tasks is calculating summary statistics – as a sample mean. This section compares Pandas and dplyr for these tasks through three problem sets.

Problem 1 – calculate the average (mean) life expectancy worldwide in 2007.

It sounds like a trivial problem – and it is. Let’s see how Pandas handles’ it.

Image 9 – Average life expectancy worldwide in 2007 (Pandas)

Let’s do the same with dplyr:

Image 10 – Average life expectancy worldwide in 2007 (dplyr)

As you can see, dplyr uses the summarize() function to calculate summary statistics, and Pandas relies on calling the function on the column(s) of interest.

Problem 2 – calculate the average (mean) life expectancy in 2007 for every continent.

A bit trickier problem, but nothing you can’t handle. The solution requires the use of group by operation on the column of interest. Here’s how to do the calculation with Pandas:

Image 11 – Average life expectancy per continent in 2007 (Pandas)

Let’s do the same with dplyr:

Image 12 – Average life expectancy per continent in 2007 (dplyr)

As you can see, both libraries use some sort of grouping functions – groupby() with Pandas, and group_by() with dplyr, which results in a similar-looking syntax.

Problem 3 – calculate the total population per continent in 2007 and sort the results in descending order.

Yet another relatively simple task to do. Let’s see how to solve it with Pandas first:

Image 13 – Total population per continent in 2007 (Pandas)

Let’s do the same with dplyr:

Image 14 – Total population per continent in 2007 (dplyr)

The sorting was the only new part of this problem. Pandas uses the sort_values() function with an optional ascending argument, while dplyr uses the arrange() function.

Winner – tie. Pandas seems to be a bit more cluttered, but that’s due to the initial filtering. Calculating summary statistics in both is easy.

Creating Derived Columns

This is the last series of tasks in today’s comparison. We’ll explore how easy it is to do feature engineering in both libraries. There are only two problem sets this time.

Problem 1 – calculate the total GDP by multiplying population and GDP per capita.

This should be easy enough to do. Let’s see the Pandas implementation first:

Image 15 – Calculating total country GDP (Pandas)

And now let’s do the same with dplyr:

Image 16 – Calculating total country GDP (dplyr)

A call to the head() function in Pandas isn’t a part of the solution, but is here only to print the first couple of rows instead of the entire dataset. Implementation in both was straightforward, to say at least.

Problem 2 – print top ten countries in the 90th percentile with regards to GDP per capita.

This one is a bit trickier, but nothing you can’t handle. Let’s see how to solve it with Pandas:

Image 17 – Top 10 countries in the 90th percentile wrt GDP per capita (Pandas)

And now let’s do the same with dplyr:

Image 18 – Top 10 countries in the 90th percentile wrt GDP per capita (Pandas)

We’ve created an additional data frame in Pandas for convenience’s sake. Still, the implementation in dplyr is much simpler and easier to read, making R’s dplyr winner of this section.

Winner – dplyr. The syntax is much cleaner and easier to read.

Conclusion

According to the test made in this article, dplyr is a clear winner. Does that mean you should abandon Pandas once and for all? Well, no.

How well you’ll solve data analysis tasks depends much on the level of familiarity with the library. If you’re a big-time Pandas user, solving tasks with dplyr might seem unnatural, resulting in more time spent to solve the task. Use the library that’ll save you the most time.

If you’re equally skilled in both, there’s virtually no debate on which is “better”.

Learn More

Appsilon is hiring for remote roles! See our Careers page for all open positions, including R Shiny Developers, Fullstack Engineers, Frontend Engineers, a Senior Infrastructure Engineer, and a Community Manager. Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.

Article Python’s Pandas vs. R’s dplyr – Which Is The Best Data Analysis Library comes from Appsilon | End to End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: python – Appsilon | End to End Data Science Solutions .

Want to share your content on python-bloggers? click here.