How to understand the relationship between pandas, openpyxl and Power Query
Want to share your content on python-bloggers? click here.
In an earlier post I provided an overview of how to use the pandas
and openpyxl
packages together to automate the production of Excel workbooks. In that post, I likened pandas
as the “Power Query of Python,” highlighting its role in cleaning and transforming tabular data.
Readers might question the need for a Power Query substitute in Python when Excel already has the genuine article. The truth, however, is more complex: pandas
does offer certain advantages and simplifies tasks that can be challenging in Power Query. The following notebook will demonstrate how pandas
performs some complex data cleaning and analysis tasks that are difficult to achieve in Power Query.
This tutorial isn’t a comprehensive guide to pandas
or openpyxl
, but it aims to show you their potential benefits for your workflow. Let’s dive into it by examining the contestants.xlsx
file:
To openpyxl
or not?
The screenshot below displays the resulting contestants_cleaned.xlsx
file from the previous notebook:
While pandas
can export a raw DataFrame like this one to an Excel workbook with some basic customization options, it doesn’t offer features like widening columns, applying conditional formatting, or inserting charts. For such tasks, you’ll need to use openpyxl
, an Excel-specific library that provides advanced functionality to manipulate Excel files. For some basics on openpyxl
, check out these blog posts.
To Python or Power Query… why not both?
This brief demo showcased data cleaning and transformation steps that are difficult or impossible to undertake in Power Query, such as working with missing values, regular expression pattern matching and more. Other areas where you may find an advantage in Power Query include dates and times, window functions and complex group-by operations.
Every analyst’s workflow is different, which means the “division of labor” everyone has between these tools is going to be different. The important part is to find your place within the spectrum: you don’t need to delete Excel once you’ve learned Python, nor does Power Query prove a superior substitute to anything in Python.
How have you combined Excel automation efforts between pandas
and openpyxl
? Which tasks do you find easier to accomplish in pandas
compared to Power Query? Share your thoughts in the comments.
Want to share your content on python-bloggers? click here.