Master Machine Learning: Multiple Linear Regression From Scratch With Python

[This article was first published on python – Better Data Science, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Linear regression is the simplest algorithm you’ll encounter while studying machine learning. Multiple linear regression is similar to the simple linear regression covered last week – the only difference being multiple slope parameters. How many? Well, that depends on how many input features there are – but more on that in a bit.

Today you’ll get your hands dirty implementing multiple linear regression algorithm from scratch. This is the second of many upcoming from scratch articles, so stay tuned to the blog if you want to learn more.

Today’s article is structured as follows:

You can download the corresponding notebook here.

Introduction to Multiple Linear Regression

Multiple linear regression shares the same idea as its simple version – to find the best fitting line (hyperplane) given the input data. What makes it different is the ability to handle multiple input features instead of just one.

The algorithm is rather strict on the requirements. Let’s list and explain a few:

  • Linear Assumption — model assumes the relationship between variables is linear
  • No Noise — model assumes that the input and output variables are not noisy — so remove outliers if possible
  • No Collinearity — model will overfit when you have highly correlated input variables
  • Normal Distribution — the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking
  • Rescaled Inputs — use scalers or normalizer to make more reliable predictions

Training multiple linear regression model means calculating the best coefficients for the line equation formula. The best coefficients can be calculated through an iterative optimization process, known as gradient descent.  

This algorithm calculates the derivates with respect to each coefficient and updates them on each iteration. How much of an update there will be depends on one parameter – learning rate. A high learning rate can lead to “missing” the best parameter values, and a low learning rate can lead to slow optimization.

More on that in the next section, where we’ll discuss the math behind the algorithm.

Math Behind Multiple Linear Regression

The math behind multiple linear regression is a bit more complicated than it was for the simple one, as you can’t simply plug the values into a formula. We’re dealing with an iterative process instead.

The equation we’re solving remains more or less the same:

Image 1 - Multiple linear regression formula (image by author)

Image 1 – Multiple linear regression formula (image by author)

We don’t have a single beta coefficient for the slope, but instead, we have an entire matrix of them – denoted as w for weights. There’s still a single intercept value – denoted as b for bias.

We’ll have to declare a cost function to continue. This is a function that measures error and represents something we want to minimize. Mean squared error (MSE) is the most common cost function for linear regression:

Image 2 - Mean squared error formula (image by author)

Image 2 – Mean squared error formula (image by author)

Put simply, it represents the average square difference between actual (yi) and predicted (y hat) values. The y hat can be expanded into the following:

Image 3 - Mean squared error formula (v2) (image by author)

Image 3 – Mean squared error formula (v2) (image by author)

As mentioned before, we’ll use the gradient descent algorithm to find optimal weights and bias. It relies on partial derivative calculation for each parameter. You can find derived MSE formulas with respect to each parameter below:

Image 4 - MSE partial derivatives (image by author)

Image 4 – MSE partial derivatives (image by author)

Finally, the update process can be summarized into two formulas – one for each parameter. Put simply, old weight (or bias) values are subtracted from the product of learning rate and the derivative calculation:

Image 5 - Update rules for multiple linear regression (image by author)

Image 5 – Update rules for multiple linear regression (image by author)

The alpha parameter represents the learning rate.

The entire process is repeated for the desired number of iterations. Let’s see how this works in practice, by implementing a from-scratch solution with Python.

From-Scratch Implementation

Let’s start with the library imports. You’ll only need Numpy and Matplotlib for now. The rcParams modifications are optional, only to make the visuals look a bit better:

 

Onto the algorithm now. Let’s declare a class called LinearRegression with the following methods:

  • __init__() – the constructor, contains the values for learning rate and the number of iterations, alongside the weights and bias (initially set to None). We’ll also create an empty list to track loss at each iteration.
  • _mean_squared_error(y, y_hat) – “private” method, used as our cost function.
  • fit(X, y) – iteratively optimizes weights and bias through gradient descent. After the calculation is done, the results are stored in the constructor. We’re also keeping track of loss here.
  • predict(X) – makes the prediction using the line equation.

If you understand the math behind this simple algorithm, implementation in Python is easy. Here’s the entire code snippet for the class:

Let’s test the algorithm next. We’ll use the Diabetes dataset from Scikit-Learn. The following code snippet loads the dataset and splits it into features and target arrays:

The next step is to split the dataset into training and testing subsets and to train the model. You can do so with the following code snippet:

Here’s how the “optimal” weight matrix looks like:

Image 6 - Optimized weight matrix (image by author)

Image 6 – Optimized weight matrix (image by author)

And here’s the optimal value for our bias term:

Image 7 - Optimized bias term (image by author)

Image 7 – Optimized bias term (image by author)

And that’s all there is to it! You’ve successfully trained the model for 10000 iterations and landed on a hopefully good set of parameters. Let’s see how good they are by plotting the loss:

Ideally, we should see a line that starts at a high loss value and quickly drops to somewhere near zero:

Image 8 - Loss per iteration (image by author)

Image 8 – Loss per iteration (image by author)

It looks promising, but how can we know if the loss is low enough to produce a good quality model? Well, we can’t, at least not directly. The best thing we can do loss-wise is to train a couple of models with different learning rates and compare loss curves. The following code snippet does just that:

The results are shown below:

Image 9 - Loss comparison for different learning rates (image by author)

Image 9 – Loss comparison for different learning rates (image by author)

It seems like a learning rate of 0.5 works the best for our data. You can retrain the model to accommodate with the following code snippet:

 

Here’s the corresponding MSE value on the test set:

Image 10 - Mean squared error on the test set (image by author)

Image 10 – Mean squared error on the test set (image by author)

And that’s how easy it is to build, train, evaluate, and tweak a multiple linear regression model from scratch! Let’s compare it to a LinearRegression class from Scikit-Learn and see if there are any severe differences.

Comparison with Scikit-Learn

We want to know if our model is any good, so let’s compare it with something we know works well — a LinearRegression class from Scikit-Learn.

You can use the following snippet to import the model class, train the model, make predictions, and print the value of the mean squared error on the test set:

Here is the corresponding MSE value:

Image 11 - Mean squared error with Scikit-Learn model (image by author)

Image 11 – Mean squared error with Scikit-Learn model (image by author)

As you can see, our tweaked model outperformed the default one from Scikit-Learn, but the difference isn’t significant. Model quality – check.

Let’s wrap things up in the next section.

Conclusion

Today you’ve learned how to implement multiple linear regression algorithm in Python entirely from scratch. Does that mean you should ditch the de facto standard machine learning libraries? No, not at all. Let me elaborate.

Just because you can write something from scratch doesn’t mean you should. Still, knowing every detail of how algorithms work is a valuable skill and can help you stand out from every other fit and predict data scientist.

Thanks for reading, and please stay tuned to the blog if you’re interested in more machine learning from scratch articles.

Learn More 

Stay connected 

The post Master Machine Learning: Multiple Linear Regression From Scratch With Python appeared first on Better Data Science.

To leave a comment for the author, please follow the link and comment on their blog: python – Better Data Science.

Want to share your content on python-bloggers? click here.