Python and R for data analytics: A tutorial with examples for aspiring data scientists

This article was first published on Technical Posts Archives - The Data Scientist , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Data is being generated at a rapid rate. From Large scale to small-scale businesses and individuals, everyone is generating data. In order to make the most of it, each and every single aspect of data needs to be analyzed. This is where “data analytics” comes in. Data analytics uses the raw data and makes conclusions about the information which allows stakeholders to make data-driven decisions.

Taking actions based on the analyses conducted on data

Data analytics belongs to the same family as data science but data science involves a lot more programming.

Data analytics is the process of examining data and extracting meaningful information from it. It is a way of quantifying and analyzing the data that is collected.

Data analytics can be used to predict future trends, make decisions, and solve problems. Data analytics can be used in many different industries including healthcare, finance, retail, and education.

The goal of data analytics is to extract insights from data that will help organizations make better decisions in the future.

The process of data analysis includes four steps: collecting the data, preparing the data for analysis, analyzing the data to find insights or patterns, and communicating those insights or patterns back to stakeholders in an understandable way

Types of Data Analytics

Following are the four main types of data analytics:

Types of Data Analytics

Descriptive Analytics

The main goal of descriptive analytics is to answer the question “what”.

Descriptive analytics summarizes the data and tries to answer the question of what happened in the past.

Diagnostic Analytics

Diagnostic analytics digs deeper into descriptive analytics and tries to explain why past events happened.

Instead of focusing on what happened in the past, diagnostic analytics tries to answer why an anomaly or any occurrence occurred in your data.

Predictive Analytics

In Predictive analytics, a data analyst observes the history and tries to predict future events.

It is a form of advanced analytics and mostly comes under the branch of data science.

In this analytics, we take historical data and use it to make forecasts about when will the parts break down, future sales, etc.

Prescriptive Analytics

Prescriptive analytics guides you to take the best course of action. It merges predictive and descriptive analytics to make data-driven decisions.

Thus, guiding you in a direction to take a specific action.

Tools for Data Analytics: Python and R

Python comes with numerous packages and tools which can be used in data analytics. Python can be used for deep learning, making predictions, and making visualizations.

R is one of the most widely used tools for data analytics. R has a number of tools that can be used for statistical analysis.

Linear regression

Now let’s see a very simple example in both Python and R. We will use a linear regression model, which is the simplest regression model in existence. We will run this model in both Python and R.

In Python, we’ll be using Linear Regression from the statsmodels library to make these predictions. And in R, we’ll be using the Linear Regression model provided by native R (no external libraries required).

Below we have a code demo in Python and R of Linear Regression along with their results.

For this purpose, we’ll be loading a fictitious dataset. Here’s our dataset named:

our fictitious dataset

Python example

Here’s the code in Python. The first step is to load the data. We will be using

First, we import the necessary libraries.

import json

from pandas import DataFrame

import statsmodels.api as sm

import numpy as np

Now we will load the data which is in json format.

file = open('data.json')

data = json.load(file)

Then we will convert our json data into a dataframe and rename the columns.

df = DataFrame(data,columns=['year','month','interest_rate','unemployment_rate','stock_index_price'])

Now close the json file.

file.close()

Let X be the independent variable, and let Y be the dependent variable.`

X = df['interest_rate']
Y = df['stock_index_price']

Add_constant function adds value of b (y =mx + b) to the data.

X = sm.add_constant(X)

Now fit the model on the data and make predictions.

model = sm.OLS(Y, X).fit()

predictions = model.predict(X)

Now print the model summary, you can see it below.

print_model = model.summary()

print(print_model)

 

After we run this, we get back the model results. The statsmodels results provide everything we expect from a statistical package. While many of the numbers here might seem a bit esoteric to fledgling data scientists, the main ones to focus on are:

  1. R-squared: A general estimation of the quality of the fit. It takes values between 0 and 1 (the higher the better).
  2. Prob (F-statistic): This is an omnibus significance test for our model. A p-value of less than 0.05 is usually taken as the model being significant.
  3. The coefficient p-values, and the direction of the effect. It looks

statsmodels linear regression results

Now, let’s do a regression plot of the fit of the model. The fit is of average quality.

fig = plt.figure(figsize=(10, 6))

fig = sm.graphics.plot_partregress_grid(model, fig = fig)

Linear regression plots python
And now here is the code in R.

First, we import the necessary libraries.

library("rjson")

library("tidyverse")

library("dplyr")

Now we will load the data which is in json format.

result <- fromJSON(file = "data.json")

Let X be the independent variable, and let Y be the dependent variable. We will load the third column of our data which is the interest rate into X, and the 5th column into Y which is the stock index price.

X <- result[c(3)]
Y <- result[c(5)]

The lm() function will fit a linear model to relations and the unlist() function converts lists to vectors.

relation <- lm( unlist(Y)~ unlist(X))

Now make a dataframe using the independent variable and use the relation found above to predict the stock index price

a <- data.frame(x = X)

result <- predict(relation,a)

print(result)

Let’s make a regression plot.

plot(unlist(X),unlist(Y),col = "blue",main = "Interest Rate & Stock Index Price",

abline(lm(unlist(Y)~unlist(X))),xlab = "Interest Rate",ylab = "Stock Price Index")

Here are the model results:

We get results similar to our python model above and similarly, we’ll be looking at R-squared, Prob (F-statistic), and p-values.

  1. R-squared gives a general estimation of the quality of the fit and has a value between 0 and 1 (the higher the better).
  2. Prob (F-statistic) is a significance test for our model. If the p-value is less than 0.05, then it is usually taken as the model being significant.
  3. The coefficient p-values, and the direction of the effect.

model summary linear regression

 

Here is the final regression plot.

Summary

In our article, we discussed some of the basic concepts of data analytics that would help any aspiring data scientist who is starting their journey in this domain. We discussed different types of analyses and the different types of tools being used. Then we saw how Python and R can be used to conduct data analysis.

If you’re an aspiring data scientist trying to enter the domain of data analytics or a skilled professional trying to acquire new skills, you can reach out to us or continue reading our other blogs. Also, make sure to check Beyond Machine which helps you become a data scientist and get a job in the field, no matter your prior knowledge.

The post Python and R for data analytics: A tutorial with examples for aspiring data scientists appeared first on The Data Scientist.

To leave a comment for the author, please follow the link and comment on their blog: Technical Posts Archives - The Data Scientist .

Want to share your content on python-bloggers? click here.