What the Hedgehog and the Fox says about Power Query and Python

[This article was first published on Stringfest Analytics, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Disclaimer: I realize comparing Python to another animal is a little confusing, but bear with me…

If you hadn’t noticed, I am a big fan of helping data analysts see that the marginal benefit of them learning to code exceeds the marginal cost. For heavy Excel users, I tend to focus on their learning Python because of its widespread adoption and the likelihood that it will become an officially supported scripting language for Excel.

Now, I get that learning to code is a little more involved than baking a pie, so there is some understandable skepticism from analysts about this; namely:

Why should I learn to code when there’s Power Query? Can’t that do all of the stuff Python could? That way I don’t have to learn to code and I’m using the tools Microsoft means us to use.

This is an interesting question, because there certainly are many things that Python does that Power Query can do too:

  • Sorting and filtering
  • Calculated columns
  • Joins
  • You know the drill…

So depending on your needs, maybe Power Query works just fine. Nice. That’s why Microsoft developed it!

But there are going to be use cases for Python, where Power Query doesn’t quite hold up. Maybe there’s some very obscure task you need to do, and you just can’t get Power Query to do it. Chances are, Python can — and there’s probably even a free package available for you. This could be something like:

  • Imputing missing values
  • Text pattern matching/regular expressions
  • Working with large or unusual datasets

In fact, Microsoft anticipated this scenario… this is why you can run Python (and R) scripts directly inside Power Query … in Power BI, at least. Again, if Power Query could do everything on its own…. why would this feature be here?

I could get into more specific examples of when to use Power Query or Python, but a fable better illustrates the differences. That way, rather than thinking inductively about specific use cases that separate Power Query and Python, we can think deductively about how the programs are designed with different philosophies in mind.

The Hedgehog and the Fox

The 20th century philosopher Isaiah Berlin wrote an essay-turned-book The Hedgehog and the Fox which immediately became a classic in both intellectual and popular circles. Its title is based on a fragment from the Ancient Greek poet Archilochus:

πόλλ’ οἶδ’ ἀλώπηξ, ἀλλ’ ἐχῖνος ἓν μέγα
“A fox knows many things, but a hedgehog knows one big thing.”

To make a long story short (literally), the idea here is that hedgehogs operate with a central principle in mind, keeping things as simple as possible. In contrast, foxes adapt quickly to new environments, rapidly incorporating new ideas without an overarching theory.

This CIO article does a great job breaking down the distinction further:

It may sound crazy, but this analogy is the key to unlocking the differences between Power Query and Python.

Power Query: The Hedgehog

Power Query operates with one principle in mind: extract, transform, and load. It cleans data. Specifically, it cleans tabular data.

Power Query expects that the overwhelming majority of users are going to have the same handful of tasks they need to perform. Otherwise, a graphical user interface (GUI) would never work, for the simple reason that there’s only so much screen monitor interface.

Need to work with images? Run a regression model? Power Query is not going to be much help. But if you need to consistently perform common data cleaning tasks, Power Query may indeed outperform Python.

Like the hedgehog, Power Query has a relatively limited habitat: data analysts and business users in a Microsoft shop. Because it’s focused on doing one thing well, Power Query likely feels comfortable to work with.

Python: The Fox

Now, for Python. I know that computer purists will skewer me for saying that Python doesn’t operate with a core tenet, because if that were really the case it wouldn’t get off the ground as a programming language.

But Python can feel very disjointed to the uninitiated user. There are all these packages to use and download, for different purposes, some of which may not jive too well with others. There are lots of little ideas working themselves out, rather than one big idea already worked out.

Like the fox, Python works in a variety of environments: engineering, data science, analytics and more. In many ways it’s become the lingua franca of computing because of its adaptability. Data cleaning is just one of many use cases, largely implemented by one of many “ideas,” the pandas package (another animal reference, I’m sorry!).

Now what?

Again, my goal here isn’t to make you think you have to pick Power Query or Python, one or the other. The fact that you can use Python from inside Power Query I hope kills that idea. Instead, I bring up this analogy as a helpful framework for understanding the different philosophies that each bring to the table.

Does your project fit within a very specific data cleaning scope? Is your audience mostly Excel users? Power Query is likely the better leg to lean on here, as a “one big thing” project. By contrast, something involving unusual use cases, that needs to pass through an engineering team, is a “many little things” fox project.

How do you decide which is the right tool for the job? What did you think of the analogy? Let’s talk in the comments.

If I’ve convinced you of the virtues of learning a bit of Python as an Excel or Power BI user, check out my book Advancing into Analytics.

To leave a comment for the author, please follow the link and comment on their blog: Stringfest Analytics.

Want to share your content on python-bloggers? click here.