How to understand the relationship between pandas, openpyxl and Power Query

This article was first published on Stringfest Analytics , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In an earlier post I provided an overview of how to use the pandas and openpyxl packages together to automate the production of Excel workbooks. In that post, I likened pandas as the “Power Query of Python,” highlighting its role in cleaning and transforming tabular data.

Readers might question the need for a Power Query substitute in Python when Excel already has the genuine article. The truth, however, is more complex: pandas does offer certain advantages and simplifies tasks that can be challenging in Power Query. The following notebook will demonstrate how pandas performs some complex data cleaning and analysis tasks that are difficult to achieve in Power Query.

This tutorial isn’t a comprehensive guide to pandas or openpyxl, but it aims to show you their potential benefits for your workflow. Let’s dive into it by examining the contestants.xlsx file:

To openpyxl or not?

The screenshot below displays the resulting contestants_cleaned.xlsx file from the previous notebook:

Screenshot of cleaned contestants dataset

While pandas can export a raw DataFrame like this one to an Excel workbook with some basic customization options, it doesn’t offer features like widening columns, applying conditional formatting, or inserting charts. For such tasks, you’ll need to use openpyxl, an Excel-specific library that provides advanced functionality to manipulate Excel files. For some basics on openpyxl, check out these blog posts.

To Python or Power Query… why not both?

This brief demo showcased data cleaning and transformation steps that are difficult or impossible to undertake in Power Query, such as working with missing values, regular expression pattern matching and more. Other areas where you may find an advantage in Power Query include dates and times, window functions and complex group-by operations.

Every analyst’s workflow is different, which means the “division of labor” everyone has between these tools is going to be different. The important part is to find your place within the spectrum: you don’t need to delete Excel once you’ve learned Python, nor does Power Query prove a superior substitute to anything in Python.

How have you combined Excel automation efforts between pandas and openpyxl? Which tasks do you find easier to accomplish in pandas compared to Power Query? Share your thoughts in the comments.

To leave a comment for the author, please follow the link and comment on their blog: Stringfest Analytics .

Want to share your content on python-bloggers? click here.