Here’s how R and Python think differently about data
Want to share your content on python-bloggers? click here.
I’m a big believer that data analysts will derive far more value from their tools when they understand the underlying philosophy and worldview of those tools.
For example, the way that open source code is created and maintained gives it distinct advantages and disadvantages over proprietary software. And if analysts can understand the differences in scope between Power Query and Python based on an ancient fable, they’ll be prone to make the right choice for their circumstances.
One of the most common questions I get from analysts is “Should I learn Python or R?” I don’t really have a one-size-fits-all answer: it depends on experience, anticipated use cases, personal taste and more.
What I can help show is how R and Python have two very different origin stories, and that this influences how each operates on data.
Most of my audience, and most analysts in general, come to this with Excel experience, so let’s take a look there first:
How Excel thinks about data
This isn’t rocket science; but it’s a helpful example: say you have a named range of data in Excel that you want to multiply by two.
Simple enough! Just pass the my_range
reference into a cell formula, multiply it by 2, and you’ll get an output range with each number in the range times 2.
Vectorize all the things
What’s really going on here? Through the magic of named ranges, Excel operated on all of the values in our range at the same time. This operation is known as vectorization, and it has a lot going for it; namely, performance. This is one reason that storing your data in named ranges and tables in Excel leads to faster operations: it’s all done in one swoop.
How does this idea of vectorization play out in R and Python?
R does that too
To walk through how this works in R, take a look at the following Jupyter notebook. (To set up R to run with Jupyter on your machine, check out these instructions.)
Note: the numbered output in the above examples is not typical for R. For example, you will not see this in RStudio…
R was born as raised as a statistical programming language. That means it’s “programmed,” so to speak, to work with data in such a way: if you say multiply a vector by two, it multiplies each value in that vector by two, much like you would in a math problem.
Other similarities exist between R and Excel; for example, the way each program indexes items.
Python does that… with some help
By contrast, Python was built as a general-purpose scripting language to handle error logs, communicate with operating systems, and so forth. It’s a very “computer-friendly” language, which is why you see it in so many contexts ranging from web development to AI. But with that generality it has hard time “reading between the lines” of how people are actually interested in data.
You’ll see in the following example that Python doesn’t exactly vectorize out of the box…
Of course, Python simply multiplying the entire list is pretty efficient in its own right. But with the help of a package, we can easily get what we want.
Both efficient in their own special way…
Don’t get me wrong, numpy
and Python in general are easy to use and learn. But Python wasn’t necessarily designed to work with data in a way that is intuitive to users, like Excel and R were. This is not a “value judgement;” these are all great tools that you would do great to learn more of.
I know multiplying a few numbers by two may not be all that relevant to what you’re looking to do with data. That said, I find this example really cuts to the quick of how the DNAs of R and Python are fundamentally different. R was born and built to do statistical analysis. Python wasn’t, although with the help of packages it works just fine.
What fundamental differences in R and Python (or Excel) have you observed? How do these influence your thinking on them? Let me know in the comments.
Want to share your content on python-bloggers? click here.