Want to share your content on python-bloggers? click here.
Pandas vs. dplyr
It’s difficult to find the ultimate go-to library for data analysis. Both R and Python provide excellent options, so the question quickly becomes “which data analysis library is the most convenient”. Today’s article aims to answer this question, assuming you’re equally skilled in both languages.
Looking for more Python and R comparisons? Check out our Python Dash vs. R Shiny comparison.
As mentioned earlier, this article assumes you are equally skilled in both R and Python. If that’s not the case, it’s likely your decision will be biased, as people tend to approve more of the familiar technologies. We’ll try to provide a completely unbiased opinion based on facts and code comparisons.
The article is structured as follows:
Data Loading
There’s no data analysis without data. Both Pandas and dplyr can connect to virtually any data source, and read from any file format. That’s why we won’t spend any time exploring connection options but will use a build-in dataset instead.
Here’s how you can load Pandas and the Gapminder dataset with Python and Pandas:
The results are shown below:
And here’s how you can do the same with R and dplyr:
Here are the results:
There’s no winner in this department, as both libraries are near identical with the syntax.
Winner – tie.
Filtering
This is where things get a bit more interesting. The dplyr package is well-known for its pipe operator (%>%
), which you can use to chain operations. This operator makes data drill-downs both easy to write and to read. On the other hand, Pandas doesn’t have such an operator.
Let’s go through three problem sets and see how both libraries compare.
Problem 1 – find records for the most recent year (2007).
Here’s how to do so with Pandas:
And here’s how to do the same with dplyr:
As you can see, both libraries are near equal when it comes to simple filtering. It’s common to use the filter()
function with dplyr and bracket notation with Pandas. There are other options, sure, but you’ll see these most commonly.
Problem 2 – find records from the most recent year (2007) only for North and South Americas.
Still a pretty simple task, but let’s see the differences in code. Pandas comes first:
And here’s how to do the same with dplyr:
Applying multiple filters is much easier with dplyr than with Pandas. You can separate conditions with a comma inside a single filter()
function. Pandas requires more typing and produces code that’s harder to read.
Problem 3 – find records from the most recent year (2007) only for the United States.
Let’s add yet another filter condition. The Pandas library comes first:
And here’s how to do the same with dplyr:
In a nutshell, Pandas is still tough to write, but you can put every filter condition on a separate line so it’s easier to read.
Winner – dplyr. Filtering is more intuitive and easier to read.
Summary Statistics
One of the most common data analysis tasks is calculating summary statistics – as a sample mean. This section compares Pandas and dplyr for these tasks through three problem sets.
Problem 1 – calculate the average (mean) life expectancy worldwide in 2007.
It sounds like a trivial problem – and it is. Let’s see how Pandas handles’ it.
Let’s do the same with dplyr:
As you can see, dplyr uses the summarize()
function to calculate summary statistics, and Pandas relies on calling the function on the column(s) of interest.
Problem 2 – calculate the average (mean) life expectancy in 2007 for every continent.
A bit trickier problem, but nothing you can’t handle. The solution requires the use of group by operation on the column of interest. Here’s how to do the calculation with Pandas:
Let’s do the same with dplyr:
As you can see, both libraries use some sort of grouping functions – groupby()
with Pandas, and group_by()
with dplyr, which results in a similar-looking syntax.
Problem 3 – calculate the total population per continent in 2007 and sort the results in descending order.
Yet another relatively simple task to do. Let’s see how to solve it with Pandas first:
Let’s do the same with dplyr:
The sorting was the only new part of this problem. Pandas uses the sort_values()
function with an optional ascending
argument, while dplyr uses the arrange()
function.
Winner – tie. Pandas seems to be a bit more cluttered, but that’s due to the initial filtering. Calculating summary statistics in both is easy.
Creating Derived Columns
This is the last series of tasks in today’s comparison. We’ll explore how easy it is to do feature engineering in both libraries. There are only two problem sets this time.
Problem 1 – calculate the total GDP by multiplying population and GDP per capita.
This should be easy enough to do. Let’s see the Pandas implementation first:
And now let’s do the same with dplyr:
A call to the head()
function in Pandas isn’t a part of the solution, but is here only to print the first couple of rows instead of the entire dataset. Implementation in both was straightforward, to say at least.
Problem 2 – print top ten countries in the 90th percentile with regards to GDP per capita.
This one is a bit trickier, but nothing you can’t handle. Let’s see how to solve it with Pandas:
And now let’s do the same with dplyr:
We’ve created an additional data frame in Pandas for convenience’s sake. Still, the implementation in dplyr is much simpler and easier to read, making R’s dplyr winner of this section.
Winner – dplyr. The syntax is much cleaner and easier to read.
Conclusion
According to the test made in this article, dplyr is a clear winner. Does that mean you should abandon Pandas once and for all? Well, no.
How well you’ll solve data analysis tasks depends much on the level of familiarity with the library. If you’re a big-time Pandas user, solving tasks with dplyr might seem unnatural, resulting in more time spent to solve the task. Use the library that’ll save you the most time.
If you’re equally skilled in both, there’s virtually no debate on which is “better”.
Learn More
- 7 Must-Have Skills to Get a Job as a Data Scientist
- Introduction to SQL: 5 Key Concepts Every Data Professional Must Know
- Hands-on R and dplyr – Analyzing the Gapminder Dataset
- Machine Learning with R: A Complete Guide to Linear Regression
- Machine Learning with R: A Complete Guide to Logistic Regression
Appsilon is hiring for remote roles! See our Careers page for all open positions, including R Shiny Developers, Fullstack Engineers, Frontend Engineers, a Senior Infrastructure Engineer, and a Community Manager. Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.
Article Python’s Pandas vs. R’s dplyr – Which Is The Best Data Analysis Library comes from Appsilon | End to End Data Science Solutions.
Want to share your content on python-bloggers? click here.