What data analysts should know about open source

Posted on December 31, 2021 by George Mount in Data science | 0 Comments

This article was first published on Stringfest Analytics , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

For a long time, many data analysts worked firmly ensconsed in the Microsoft stack — definitely Excel, maybe some Access, PowerPoint or VBA serving as the automative glue. If it wasn’t Microsoft, it was some other proprietary toolkit from IBM or Oracle, paid off-the-shelf software, and so forth.

These days? We have the godfather of Excel Bill Jelen making a tutorial on how to work in Excel with… JavaScript. JavaScript!

We also have the very real possibility that Python will be officially supported for Excel. Thousands have asked, and Microsoft is (probably) answering:

Why should you care?

As it turns out, both JavaScript and Python are open source tools, along with R, Spark, Docker and so many other tools powering the data boom. So we have data analysts using open source tools, many of them for the first time and often in conjunction with proprietary products such as Excel, Power BI, etc.

Why does this matter?

I believe that to be an educated user of data, you should have some idea of where your tools came from. As a gourmand, you probably want to know how your food was sourced, right? It helps you contextualize how the pieces of your craft come together.

Without some knowledge of provenance, you’re just a consumer. Learn a bit about the origin and philosophy behind the tools you use, and you’ll be that much savvier about how to use them right.

So here are some tidbits data analysts would benefit from learning about open source: some good, some questionable, all helpful in improving as an analyst.

It’s a different philosophy of ideas

First and foremost, open source software reflects an entirely different philosophy than proprietary: about the philosophy, economics and culture of information. As you might expect, there’s a lot of hair-splitting about what open source really means. One common distinction is what’s meant by calling open source “free.” Sure, it doesn’t cost anything — but something could be proprietary and still gratis. That’s why you may hear open source described as “free as in speech” rather than “free as in beer:” in other words, you are given protected rights to engage with the software however you see fit, rather than just consume it passively.

Enough semantics; let’s illustrate the primary difference between proprietary and open source with an example:

Imagine you stood on a street corner hawking copies of Excel. You figured that would be fine since you are throwing in a few of your own add-ins, right? How do you think Microsoft would feel about this venture of yours? They’ve got intellectual property protection against that stuff.

By contrast, with open source most if not all intellectual property restrictions are waived. You’d be perfectly welcome to add a few goodies to a download of Python and sell that (and you’ll see that’s exactly what one of the companies we’ll look at in a bit does).

If you’d like to learn more about what open source means, check out this article from Red Hat — one of the pioneers of the practice, by the way.

There’s a thriving aftermarket… paid and free

Red Hat makes a fascinating open source case study, because while it relies on “free” open source software, it is a publicly traded company. If it sounds crazy that one can build a business on open source… it’s actually become quite common!

For example, in Advancing into Analytics we take advantage of two free downloads from companies built atop open source: RStudio Desktop to work with R, and the Anaconda distribution of Python.

RStudio

That’s the wild thing about open source: because the intellectual property is so loose, there’s plenty of opportunity for people to build commercial enterprises right atop it in ways that could get hairy with proprietary software.

Volunteer community

Another major difference (and this one is important for data analysts to keep in mind) is that historically, open source projects were maintained solely by volunteers. There’s not necessarily an official support channel or ticket you can file with these products.

Now, fortunately the market is providing for some solutions here — for example, many of the aforementioned “business of open source” companies will provide paid support licenses. There’s also been more visibility into the support and long-term planning of open source projects.

But it’s still something data analysts should perhaps consider before using open source on a project: things may be more fragile, and if something goes wrong they may need to be a bit more self-reliant in troubleshooting.

The do-it-yourself ethos runs high in open source in general, but that also brings about a large and vocal community. Open source users, developers and advocates can be opinionated and maybe a bit bossy to outsiders, but perhaps that’s because they are quite personally invested in these projects. We’ll talk a bit more about the organization of open source in a bit. But first, a bit more history.

The bad blood is over

Perhaps one of the most significant reasons data analysts hadn’t been using open source tools until recently is that Microsoft didn’t want them to! Nearly as soon as he’d co-founded Microsoft, Bill Gates wrote a famous “open letter to hobbyists” denouncing the rampant unauthorized distribution of software.

Perhaps Gates had a point, as this distribution more or less amounted to piracy. But what about software that was designed to be redistributed for free? Microsoft didn’t like that either, with later CEO Steve Ballmer in 2001 declaring the open source operating system Linux as a “a cancer that attaches itself in an intellectual property sense to everything it touches.”

Flash forward a few years and Microsoft admitted it was wrong on open source; among other actions, it’s released a full Linux kernel for Windows and has open sourced many of its own products like Visual Studio Code and PowerShell.

In fact, it’s even been said on more than one occasion that Microsoft Linux…

Microsoft admitted there are many things admirable about open source, and that many open source projects offer features that complement their own quite nicely.

For example, Power BI supports R and Python for data visualization, data cleaning, machine learning and more. In fact, I have a course on R for Power BI users course for you to check out if you’re interested:

Upcoming Course: Basics Of R For Power BI Users (Part 1)

And, as mentioned earlier, there are now ways to work with JavaScript and Excel Online. As I like to say, Microsoft wouldn’t be spending millions of dollars to make these tools available if there weren’t any benefit to using them. Admitting the benefits of these open source tools is a win for everyone.

It moves quickly

Have you noticed that tech moves kind of quickly these days? I’m really not that old, but as a college student I didn’t have a smartphone or tablet or GPS. Could you imagine a student these days operating without those?

As technology rapidly changes, so does the code needed to power it. Code that is open sourced for anyone to develop and tweak can move much faster — for better or worse. Yes, this can be a bout of “move fast and break things.” This goes back to the idea of proprietary software being arguably more stable.

In many cases, there may be a cutting-edge data science approach that just hasn’t been fit into a proprietary tool yet. Open source does let software move at the speed of research. I would say this has been a critical factor in raising the status of data science in modern business.

This CNBC video does a great job exploring just way open source has become so dominant today:

One theme touched on by the video is the role of GitHub in facilitating open source development. This is a theme echoed by a fantastic book on open source, Working in Public by Nadia Eghbal. GitHub in many ways has become the common denominator for distributed contributors to work together on open source, such that learning a bit of GitHub is a de facto rite of passage for any programmer today.

You’ll learn even more about open source from this book

And hey, would you look at that: Microsoft now owns GitHub. Again, it’s almost like Microsoft wants its customers to be able to use open source these days…

Get a bigger perspective on tech

Chances are very high that something your organization uses is open source. In fact, Red Had found that 90% of IT respondents reported enterprise open source is being used.

If you work with any data scientists or engineers, they’re almost certainly using open source. By understanding through experience how it works, its pros and cons, you’ll be in a much better place to collaborate effectively with them.

As you’re seeing, open source has been a crucial element to today’s tech boom, so much that I would argue that any serious technical professional (and I’d consider data analyst a candidate here!) should have at least some exposure to it!

Sounds interesting… now what?

Glad I piqued your interest! Here’s what I suggest:

Consider some of the types of tools you use regularly on the job: spreadsheets, databases, ETL tools, etc. Do some product research. Which are open source and proprietary? What ramifications are there?
Most open source analytics tools are code-based. Learn a bit! Understand enough to know what a distribution is, what packages are, etc. See what type of projects are going for that language on GitHub.
Take a look at the proprietary software you’re using. Are there opportunities to integrate open source tools with it? Many vendors have followed Microsoft in encouraging open source work.

Whatever software you use for analytics — it didn’t just fall from the sky. It was developed by people, under a contract, for good and bad. With as much impact it’s made on the world, it’s worth data analysts learn a bit about the wild world of open source and even experience it a bit for themselves. Chances are more likely that they can (or must) use it for a future project. If Mr. Excel’s doing it… you probably should too.

To leave a comment for the author, please follow the link and comment on their blog: Stringfest Analytics .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers