R vs Python: Which is better for Data Science?
Want to share your content on python-bloggers? click here.
As data science becomes more and more applicable across every industry sector, you might wonder which programming language is best for implementing your models and analysis. If you attend a data science bootcamp, Meetup, or conference, chances are you’ll run into people who use one of these languages. Since R and Python remain the most popular languages for data science, according to IEEE Spectrum’s latest rankings, it seems reasonable to debate which one is better. Although it’s suggested to use the language you are most comfortable with and one that suits the needs of your organization, for the purpose of this article, we will evaluate the two languages. We will compare R and Python in four key categories: Data Visualization, Modelling Libraries, Ease of Learning and Community Support.
A significant part of data science is communication. Most of the time, you as a data scientist need to show your result to colleagues with little or no background in mathematics or statistics. So being able to illustrate your results in an impactful and intelligible manner is very important. Any language or software package for data science should have good data visualization tools.
Good data visualization involves clarity. No matter how complicated your model is, there will be a simple and unambiguous way of illustrating your results such that even a layperson would understand.
Python is renowned for its extensive number of libraries. There are plenty of libraries that can be used for plotting and visualizations. The most popular libraries are
seaborn. The library
matplotlib is adapted from
MATLAB, it has similar features and styles. The library is a very powerful visualization tool with all kinds of functionality built-in. It can be used to make simple plots very easily, especially as it works well with other Python data science libraries,
matplotlib can make a whole host of graphs and plots, what it lacks is simplicity. The most troublesome aspect is adjusting the size of the plot: if you have a lot of variables it can get hectic trying to neatly fit them all in one plot. Another big problem is creating subplots; again, adjusting them all in one figure can get complicated.
seaborn builds on top of
matplotlib, including more aesthetic graphs and plots. The library is surely an improvement on
matplotlib’s archaic style, but it still has the same fundamental problem: creating figures can be very complicated. However, recent developments have tried to make things simpler.
There are many libraries that could be used for data visualization in R but
ggplot2 is the clear winner in terms of usage and popularity. The library uses a grammar of graphics philosophy, with layers used to draw objects on plots. Layers are often interconnected to each other and can share many common features. These layers allow one to create very sophisticated plots with very few lines of code. The library allows plotting of summary functions. Thus,
ggplot2 is more elegant than
matplotlib and thus I feel that in this department R clearly has an edge.
It is, however, worth noting that that Python includes a
ggplot library, based on the similar functionality as the original
ggplot2 in R. It is for this reason that R and Python both are on par with each other in this department.
Data science requires the use of many algorithms. These sophisticated mathematical methods require robust computation. It is rarely or maybe never the case that you as a data scientist need to code the whole algorithm on your own. Since that is incredibly inefficient and sometimes very hard to do so, data scientists need languages with built-in modelling support. One of the biggest reasons why Python and R get so much traction in the data science space is because of the models you can easily build with them.
As mentioned earlier Python has a very large number of libraries. So naturally, it comes as no surprise that Python has an ample amount of machine learning libraries. There is
PyTorch just to name a few. Python also has
pandas, which allows tabular forms of data. The library pandas makes it very easy to manipulate CSVs or Excel-based data.
In addition to this Python has great scientific packages like
numpy, you can do complicated mathematical calculations like matrix operations in an instant. All of these packages combined, make Python a powerhouse suited for hardcore modelling.
R was developed by statisticians and scientists to perform statistical analysis way before that was such a hot topic. As one would expect from a language made by scientists, one can build a plethora of models using R. Just like Python, R too has plenty of libraries — approximately 10000 of them. The
caret are the most widely used. These packages will have your back, starting from the pre-modelling phase to the post-model/optimization phase.
Since you can use these libraries to solve almost any sort of problem; for this discussion let’s just look at what you can’t model. Python is lacking in statistical non-linear regression (beyond simple curve fitting) and mixed-effects models. Some would argue that these are not major barriers or can simply be circumvented. True! But when the competition is stiff you have to be nitpicky in order to decide which is better. R, on the other hand, lacks speed that Python provides, which can be useful when you have large amounts of data (big data).
Ease of Learning
It’s no secret that currently data scientist is one of the most in-demand jobs, if not the one most in demand. As a consequence, many people are looking to get into the data science bandwagon, many of them have little or no programming experience. Learning a new language can be challenging, especially if it is your first. For this reason it appropriate to include ease of learning as a metric when comparing the two languages.
Designed in 1989 with a philosophy that emphasizes code readability and a vision to make programming easy or simple, the designers of Python clearly succeeded as the language is fairly easy to learn. Although Python takes inspiration for its syntax from C, unlike C it is uncomplicated. I recommend it as my choice of language for beginners, since anyone can pick it up in relatively less time.
In this category Python is the clear winner. However, it must be noted that programming languages in general are not hard to learn. If a beginner wanted to learn R, it won’t be as easy in my opinion as learning Python but it won’t be an impossible task either.
Every so often as a data scientist you are required to solve problems that you haven’t encountered before. Sometimes you may have difficulty finding the relevant library or package that could help you solve your problem. To find a solution, it is not uncommon for people to search in the language’s official documentation or online community forums. Having a good community support can help programmers in general to work more efficiently.
Both of these languages have active Stackoverflow members and also an active mailing list available (where one can easily ask for solutions from experts). R has online R-documentation where you can find information about certain functions and function inputs. Most Python libraries like
scikit-learn have their own official online documentation that explains each library.
Both languages have significant amount of user base, hence, they both have a very active support community. It isn’t difficult to see that both seem to be equal in this regard.
R has been used for statistical computing for over two decades now. You can get started with writing useful code in no time. It has been used extensively by data scientists and has an insane number of packages available for a lot of data science related tasks. I have almost always been able to find a package in R to get the task done very quickly. I have decent python skills and have written production code in python. Even with that, I find R slightly better for quickly testing out ideas, trying out different ways to visualize data and for rapid prototyping work.
Python has many advantages over R in certain situations. Python is a general purpose programming language. Python has libraries like pandas, numpy, scipy and scikit-learn, to name a few which can come in handy for doing data science related work.
If you get to point where you have to showcase your data science work, Python once would be a clear winner. Python combined with django is an awesome web application framework, which can help you create a web service/site with both your data science and web programming done in the same language.
You may hear some speed and efficiency arguments from both camps – ignore them for now. If you get to a point when you are doing something substantial enough where speed of your code matters to you, you will probably figure out things on your own. So don’t worry about it at this point.
Considering that you are a beginner in both data science and programming, and that you have a background in Economics and Statistics, I would lean towards R. Besides being very powerful, Python is without a doubt one of the most friendly programming languages to beginners – but it is still a programming language. Your learning curve may be a bit steeper in Python as opposed to R.
You should definitely learn Python, once you are comfortable with R, and have grasped the general concepts datascience – which will take some time. You can read What are the key skills of a data scientist? to get an idea of the skill set you will need to become a data scientist.
Start with R, transition to Python gradually and then start using both as needed. Both are great for data science but one is better than other in certain situations.
Want to share your content on python-bloggers? click here.