RANSAC is an acronym for Random Sample Consensus. What this algorithm does is fit a regression model on a subset of data that the algorithm judges as inliers while removing outliers. This naturally improves the fit of the model due to the removal of some data points.
The process that is used to determine inliers and outliers is described below.
- The algorithm randomly selects a random amount of samples to be inliers in the model.
- All data is used to fit the model and samples that fall with a certain tolerance are relabeled as inliers.
- Model is refitted with the new inliers
- Error of the fitted model vs the inliers is calculated
- Terminate or go back to step 1 if a certain criterion of iterations or performance is not met.
In this post, we will use the tips data from the pydataset module. Our goal will be to predict the tip amount using two different models.
- Model 1 will use simple regression and will include total bill as the independent variable and tips as the dependent variable
- Model 2 will use multiple regression and includes several independent variables and tips as the dependent variable
The process we will use to complete this example is as follows
- Data preparation
- Simple Regression Model fit
- Simple regression visualization
- Multiple regression model fit
- Multiple regression visualization
Below are the packages we will need for this example
import pandas as pd from pydataset import data from sklearn.linear_model import RANSACRegressor from sklearn.linear_model import LinearRegression import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import mean_absolute_error from sklearn.metrics import r2_score
For the data preparation, we need to do the following
- Load the data
- Create X and y dataframes
- Convert several categorical variables to dummy variables
- Drop the original categorical variables from the X dataframe
Below is the code for these steps
df=data('tips') X,y=df[['total_bill','sex','size','smoker','time']],df['tip'] male=pd.get_dummies(X['sex']) X['male']=male['Male'] smoker=pd.get_dummies(X['smoker']) X['smoker']=smoker['Yes'] dinner=pd.get_dummies(X['time']) X['dinner']=dinner['Dinner'] X=X.drop(['sex','time'],1)
Most of this is self-explanatory, we first load the tips dataset and divide the independent and dependent variables into an X and y dataframe respectively. Next, we converted the sex, smoker, and dinner variables into dummy variables, and then we dropped the original categorical variables.
We can now move to fitting the first model that uses simple regression.
Simple Regression Model
For our model, we want to use total bill to predict tip amount. All this is done in the following steps.
- Instantiate an instance of the RANSACRegressor. We the call LinearRegression function, and we also set the residual_threshold to 2 indicate how far an example has to be away from 2 units away from the line.
- Next we fit the model
- We predict the values
- We calculate the r square the mean absolute error
Below is the code for all of this.
ransacReg1= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0) ransacReg1.fit(X[['total_bill']],y) prediction1=ransacReg1.predict(X[['total_bill']])
r2_score(y,prediction1) Out: 0.4381748268686979 mean_absolute_error(y,prediction1) Out: 0.7552429811944833
The r-square is 44% while the MAE is 0.75. These values are most comparative and will be looked at again when we create the multiple regression model.
The next step is to make the visualization. The code below will create a plot that shows the X and y variables and the regression. It also identifies which samples are inliers and outliers. Te coding will not be explained because of the complexity of it.
inlier=ransacReg1.inlier_mask_ outlier=np.logical_not(inlier) line_X=np.arange(3,51,2) line_y=ransacReg1.predict(line_X[:,np.newaxis]) plt.scatter(X[['total_bill']][inlier],y[inlier],c='lightblue',marker='o',label='Inliers') plt.scatter(X[['total_bill']][outlier],y[outlier],c='green',marker='s',label='Outliers') plt.plot(line_X,line_y,color='black') plt.xlabel('Total Bill') plt.ylabel('Tip') plt.legend(loc='upper left')
Plot is self-explanatory as a handful of samples were considered outliers. We will now move to creating our multiple regression model.
Multiple Regression Model Development
The steps for making the model are mostly the same. The real difference takes place in make the plot which we will discuss in a moment. Below is the code for developing the model.
ransacReg2= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0) ransacReg2.fit(X,y) prediction2=ransacReg2.predict(X)
r2_score(y,prediction2) Out: 0.4298703800652126 mean_absolute_error(y,prediction2) Out: 0.7649733201032204
Things have actually gotten slightly worst in terms of r-square and MAE.
For the visualization, we cannot plot directly several variables t once. Therefore, we will compare the predicted values with the actual values. The better the correlated the better our prediction is. Below is the code for the visualization
inlier=ransacReg2.inlier_mask_ outlier=np.logical_not(inlier) line_X=np.arange(1,8,1) line_y=(line_X[:,np.newaxis]) plt.scatter(prediction2[inlier],y[inlier],c='lightblue',marker='o',label='Inliers') plt.scatter(prediction2[outlier],y[outlier],c='green',marker='s',label='Outliers') plt.plot(line_X,line_y,color='black') plt.xlabel('Predicted Tip') plt.ylabel('Actual Tip') plt.legend(loc='upper left')
The plots are mostly the same as you cans see for yourself.
This post provided an example of how to use the RANSAC regressor algorithm. This algorithm will remove samples from the model based on a criterion you set. The biggest complaint about this algorithm is that it removes data from the model. Generally, we want to avoid losing data when developing models. In addition, the algorithm removes outliers objectively this is a problem because outlier removal is often subjective. Despite these flaws, RANSAC regression is another tool that can be use din machine learning.