Python-bloggers

Build a PyTorch regression MLP from scratch

This article was first published on Python – Hutsons-hacks , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.


Fork


Star


Watch


Download


Issue

In this post we will go through how to build a feed forward neural network from scratch, with the awesome PyTorch library. This library was developed by researchers in Meta (Facebook) to enable them to process natural language with ease. Here I will show you how to extend this to some of your more common tabular data tasks.

Where can I get the scripts?

The scripts are available:

Building the training script

The below steps will take you through how to build the training script end-to-end.

Let’s get going!!!

What data are we using?

The data we are going to use is medical insurance data, and was posted as a challenge on machine learning competition Kaggle. This platform pits ML engineers and practitioners in a contest to find the most accurate model. As this was a regression challenge the metric to optimise was chosen as the root mean squared error (RMSE).

To source the data for this tutorial directly go to my GitHub data resource (https://github.com/StatsGary/Data/blob/main/insurance.csv), however the reference to this is linked in the underlying code we are about to create.

Within the data we see 7 columns and 1338 rows, with the fields containing:

With the charges column we are going to try and estimate the charges per patient requiring insurance.

One thing to note

Here we will focus more on building the model architecture and components. One thing we won’t be doing is extensive feature engineering, outlier analysis and feature selection, as these would lead to an entirely new article.

Importing the packages we will need

The following are the list of imports we are going to need:

Under custom imports we will build our own MLPRegressor model and store this in a models folder to be used across multiple projects. I will detail the folder structure needed once we are at that juncture.

Data loading and initial setup

We will load the data in and perform a few steps, as well as setting our batch size, which we will use later as a parameter in our model training step.

Here we are:

The next step would be some initial feature engineering to treat the none continuous columns i.e. those with multiple levels and categorical descriptions.

Feature engineering and creating tensors

Following the code snippet I will take you through each one of the lines of code in a stepwise fashion:

Strap yourselves in and here we go:

Set our embeddings

An embedding is a mapping of a discrete — categorical — variable to a vector of continuous numbers. In the context of neural networks, embeddings are low-dimensional, learned continuous vector representations of discrete variables. Neural network embeddings are useful because they can reduce the dimensionality of categorical variables and meaningfully represent categories in the transformed space.

Essentially, in this context they are the values from each of the feature columns mapped as embeddings in the tensor, and function slightly differently to embeddings used in NLP models and neural nets, albeit they are still a learned continuous vector. To get the embeddings of any tabular model, you can use the following snippet:

To be logical, the following occurs:

Great – we have our embeddings for our categorical columns i.e. the number of levels each of the 4 columns goes down to. To translate this:

The next step is on to our model definition and once we have built this model backbone, we will save it in a seperate python file and store it in our folder structure so we can import it into all projects where thee MLPRegressor is utilised.

Building our regression skeleton

I will include the whole code and then go through each step as we go along:

Define the modelling block

Let’s dive into this:

Let’s break at this point and grab a coffee! You have earned it!

Once we know the input structure we can now add to the layers by looping through each of the layers and building a sub structure underneath – this is what the code below does:

This will add:

The last step of the model definition is to set the self.layers variable to a sequential model (nn.Sequential), as we want to process each one of the steps in sequence. Next we define the forward propagation method for our model.

Define the forward passing setup

We will project the following forward through the network and perform the following steps:

In a nutshell – what we have done is built a way to handle continuous and categorical variables, in our network, with embeddings. We have defined the model structure and made it sequential and then indicated how the forward

Saving our MLPRegressor structure

Now we have this structure built we will save this in a separate Python file called Regression.py and this will be nested inside a models directory. Your folder structure should look like this:

Once we have this we can import it into our two main Python projects using

from models.Regression import MLPRegressor

Use the model

The following steps will load the model in and we will pass inputs to the model to get it setup:

The steps here are:

Splitting our data

Before we move on to setting up the training loop we are going to split the data into train and test sets (you could add a validation split here as well):

Creating our training loop

This is the workhorse and something you will see in every PyTorch implementation. This indicates how to train the network and perform weight updates and optimisation steps. I will take you through this step-by-step after the Git Gist below:

Let’s step into this:

Using our training loop

Using multiple assignment we will store each one of the outputs of our train function:

Here we pass in the model, y_train (training of the outcome variable), categorical training variables, continuous training variables, this is repeated for our validation data, select a learning rate, the number of epochs and the print_out_interval. When this is triggered – the script will output the below:

View our predictions vs actuals

Next, we will visualise where our model is performant and where it is way off the mark:

This will store our results in a data.frame and then visualises the data in a scatter chart:

This is a difficult dataset, as there are many outliers and it is apparent that there appear to be bands of people with different types of medical insurance values, this would be indicative of the type of procedure they needed the medical insurance for and who require different levels of medical insurance. You could treat the outliers and repeat the training – I will leave this to you to perfect, as the aim of this post is to show how to use PyTorch to create a regression model.

Produce our model training graph

We will now see how well the model training performed:

This step uses list compression to iterate through the losses list we created in our training loop and extracts every .item() i.e. loss from the list and calls the new list losses_collapsed. We do a similar comprehension to get the number of epochs and then create a pandas data.frame.

We then save the data to csv and create the SNS chart. The chart looks as below:

This shows our model is still learning after 400 epochs, as the loss is still on the decline, we could extend the epochs further to tweak the model loss further.

We now have our training script in place, the full code is captured here:

Building our model inference script

Now we are going to use our trained model to infer from our production data. Here we will pass through multiple examples from our production medical insurance dataset. This dataset, in real life, would be the passing through new values that we want to estimate the medical insurance cost for. Moreover, we would not know the actual cost, as we are trying to make new predictions.

Feature engineering

We will repeat the same steps as the previous example, with one slight change of loading in the production medical insurance dataset:

This next step is important. We need to specify a list with tuples in for the same embeddings sizes as in the previous training script, as if we tried to do this with the inference script dynamically, we would have less of our categorical columns and there would be a shape mismatch. I have hard coded this, but you could import it as a text file, json file or similar. Again, this embedding should match our training embedding sizes for the network to work correctly:

Right, we have everything in place, such as our categorical and continuous values encoded and converted to PyTorch tensors and our embeddings have been translated from the training script to match the shapes. Please note – for your own dataset – this would need to be updated to match the shape of your categorical and continuous values.

Load and use our saved model

In the next steps we will load our saved model, with the same parameters we used for training and then load the state_dict() from the model_artefacts folder (this could be anything you like, I just called it model artefacts):

The print of model_infer.eval() will print out the original saved model structure:

Here you can see the importance of our embeddings matching, otherwise the model will throw a wobbly!

Define function to process our prod data

I will explain this function in more detail underneath the code:

Let’s break this down:

Running the function we get the below print outs:

Some of these predictions are dubious, as the outliers and differences in medical insurance costs is obviously confusing the model. I would suggest to make this model more useful to create a piece wise regression model to deal with the different levels and treatment of the outliers.

The full code for the inference script is here:

We have reached the end!

Wow – congratulations on getting this far. We have covered so much content in this tutorial.

Feel free to adapt the code and create a pull request to the GitHub repository if you want to add or adapt the code in any way. Remember – the aim of this tutorial was to show how you can create a regression network in PyTorch, and not to go way in depth in the billion ways you can encode features before modelling – that would be its own tutorial.

Learning PyTorch is harder than Tensorflow, as it is very pythonic and requires you to build classes, however once you get used to it the tool is very powerful and is mostly used in my work with natural language processing at my company.

You have done well and keep on coding!

To leave a comment for the author, please follow the link and comment on their blog: Python – Hutsons-hacks .

Want to share your content on python-bloggers? click here.
Exit mobile version