In an earlier post I provided an overview of how to use the
openpyxl packages together to automate the production of Excel workbooks. In that post, I likened
pandas as the “Power Query of Python,” highlighting its role in cleaning and transforming tabular data.
Readers might question the need for a Power Query substitute in Python when Excel already has the genuine article. The truth, however, is more complex:
pandas does offer certain advantages and simplifies tasks that can be challenging in Power Query. The following notebook will demonstrate how
pandas performs some complex data cleaning and analysis tasks that are difficult to achieve in Power Query.
This tutorial isn’t a comprehensive guide to
openpyxl, but it aims to show you their potential benefits for your workflow. Let’s dive into it by examining the
openpyxl or not?
The screenshot below displays the resulting
contestants_cleaned.xlsx file from the previous notebook:
pandas can export a raw DataFrame like this one to an Excel workbook with some basic customization options, it doesn’t offer features like widening columns, applying conditional formatting, or inserting charts. For such tasks, you’ll need to use
openpyxl, an Excel-specific library that provides advanced functionality to manipulate Excel files. For some basics on
openpyxl, check out these blog posts.
To Python or Power Query… why not both?
This brief demo showcased data cleaning and transformation steps that are difficult or impossible to undertake in Power Query, such as working with missing values, regular expression pattern matching and more. Other areas where you may find an advantage in Power Query include dates and times, window functions and complex group-by operations.
Every analyst’s workflow is different, which means the “division of labor” everyone has between these tools is going to be different. The important part is to find your place within the spectrum: you don’t need to delete Excel once you’ve learned Python, nor does Power Query prove a superior substitute to anything in Python.
How have you combined Excel automation efforts between
openpyxl? Which tasks do you find easier to accomplish in
pandas compared to Power Query? Share your thoughts in the comments.