Build a PyTorch regression MLP from scratch
Want to share your content on python-bloggers? click here.
In this post we will go through how to build a feed forward neural network from scratch, with the awesome PyTorch library. This library was developed by researchers in Meta (Facebook) to enable them to process natural language with ease. Here I will show you how to extend this to some of your more common tabular data tasks.
Where can I get the scripts?
The scripts are available:
- Full repository with requirements.txt file to replicate set up in your virtual environment: https://github.com/StatsGary/PyTorch_Tutorials/tree/main/03_MLP_Regression
- Training script: https://github.com/StatsGary/PyTorch_Tutorials/blob/main/03_MLP_Regression/mlp_train.py
- Inference script: https://github.com/StatsGary/PyTorch_Tutorials/blob/main/03_MLP_Regression/mlp_infer.py
Building the training script
The below steps will take you through how to build the training script end-to-end.
Let’s get going!!!
What data are we using?
The data we are going to use is medical insurance data, and was posted as a challenge on machine learning competition Kaggle. This platform pits ML engineers and practitioners in a contest to find the most accurate model. As this was a regression challenge the metric to optimise was chosen as the root mean squared error (RMSE).
To source the data for this tutorial directly go to my GitHub data resource (https://github.com/StatsGary/Data/blob/main/insurance.csv), however the reference to this is linked in the underlying code we are about to create.
Within the data we see 7 columns and 1338 rows, with the fields containing:
- Age of patient seeking medical insurance
- Sex of patient (categorical)
- Body Mass Index (BMI) – continuous variable
- Number of children patient has
- Smoker – categorical variable
- Region of patient – categorical variable
- Charges (continuous variable) and our outcome of interest
With the charges column we are going to try and estimate the charges per patient requiring insurance.
One thing to note
Here we will focus more on building the model architecture and components. One thing we won’t be doing is extensive feature engineering, outlier analysis and feature selection, as these would lead to an entirely new article.
Importing the packages we will need
The following are the list of imports we are going to need:
Under custom imports we will build our own MLPRegressor model and store this in a models folder to be used across multiple projects. I will detail the folder structure needed once we are at that juncture.
Data loading and initial setup
We will load the data in and perform a few steps, as well as setting our batch size, which we will use later as a parameter in our model training step.
Here we are:
- setting a variable with data_name to capture the current project – we will use this later on when we are saving our artifacts and models
- creating a variable called df to store our raw GitHub user content for the insurance file
- dropping any null values – there are many more ways you could treat this data, one of which is MICE, but as I said we will be focussing on building the model out and not the hundred other steps to getting the data into a better shape
- getting the number of rows in the dataset
- creating a batch size that is half of the number of observations – for better results in the model training you could tune this as a hyperparameter
The next step would be some initial feature engineering to treat the none continuous columns i.e. those with multiple levels and categorical descriptions.
Feature engineering and creating tensors
Following the code snippet I will take you through each one of the lines of code in a stepwise fashion:
Strap yourselves in and here we go:
cat_cols
is a list of the categorical column names sex, smoker, region and childrencont_cols
is a list of continuous column names age and bmi- At this point it goes without saying that you would need to adapt these list values for your own use case, as this structure can be used as a template for future projects
y
is equal to our outcome column – for this example we are trying to predict the charges for each part of the medical insurance- The next step is to loop through our cat_cols to set all the types to category using the
astype('category')
method - After we have done these steps we need to convert our dictionary objects (data frame) to a numpy array representation and we will use the stack method to stack our arrays on top of each other – here we use a list comprehension to achieve this result.
- Once we have the
cats
variable in an array, we can simply use torch.tensor to convert this into a torch readable tensor for processing with PyTorch - We repeat the same steps for the continuous variables to make sure we have them tensorfied (is that even a word?)
- Finally, we will make sure the outcome variable is also a tensor and cast to torch.float, as the outcome will have multiple decimal places after it
Set our embeddings
Essentially, in this context they are the values from each of the feature columns mapped as embeddings in the tensor, and function slightly differently to embeddings used in NLP models and neural nets, albeit they are still a learned continuous vector. To get the embeddings of any tabular model, you can use the following snippet:
To be logical, the following occurs:
- We use a list comprehension to loop through each
co
l in the categorical columns and take the length of the original data frames categories – essentially how many levels does each category have – this we callcat_szs
- Then, we use a second comprehension to get the embedding size between a minimum of 50 and the size _1 with an integer division divided by 2. This magic will then work out the embeddings needed for any tabular regression problem. Be sure to note this down, as you will need the output of this for the inference step later on. Printed out this shows the embedding sizes, for our current problem, as below:
Great – we have our embeddings for our categorical columns i.e. the number of levels each of the 4 columns goes down to. To translate this:
- Sex has two values 1) male and 2) female
- Smoker has two values 1) yes and 2) no
- Region has four levels
- Children has six levels
The next step is on to our model definition and once we have built this model backbone, we will save it in a seperate python file and store it in our folder structure so we can import it into all projects where thee MLPRegressor
is utilised.
Building our regression skeleton
I will include the whole code and then go through each step as we go along:
Define the modelling block
Let’s dive into this:
- The MLPRegressor takes an input as a
nn.Module
- In the initialisation function
__init__()
we will pass in the embedding sizes, number of continuous variables, the output size, the layer structure (as a list) and the drop out probability (this, in essence, drops out nodes in the network at random and prevents overfitting) - We then use
super().__init__()
to inherit (via class inheritance) all the parameters and structure of the master class - This is followed by defining
self.embed
and using thenn.ModuleList
to loop through each embedding in the embedding sizes, remember we have a list here with four tuples, each containing two values. Hence, we are setting the embedding sizes for the network - The dropout layer is then defined using
self.emb_drop
and setting it tonn.Dropout
with the required probability to drop out at random - We then batch normalise the continuous variables to bring them down to a standardised scale using
nn.BatchNorm1d()
- After we have defined the neural network steps we will use an empty list to specify the layer depth in our network
- We then set
n_emb
to the sum of our number of features in our embedding sizes - This is followed by the variable
n_in
which adds the embeddings to our continuous features to get the full dataset
Let’s break at this point and grab a coffee! You have earned it!
Once we know the input structure we can now add to the layers by looping through each of the layers and building a sub structure underneath – this is what the code below does:
This will add:
nn.Linear
layer as the first input – which will take the size of the layer and match the number of inputs, as this is a regression model, so will be related to the good old regression algorithms Galton first espoused, postulated and discoverednn.Relu
as the activation function to usenn.BatchNorm1d
as the way of normalising each batchnn.Dropout
as the random node dropout probability- Finally, append each one as a linear layer by the output size
The last step of the model definition is to set the self.layers
variable to a sequential model (nn.Sequential
), as we want to process each one of the steps in sequence. Next we define the forward propagation method for our model.
Define the forward passing setup
We will project the following forward through the network and perform the following steps:
- Initialise an empty embeddings list
- Loop through the self.embeds variable and append the categorical values to the empty embeddings list
- We will then concatenate the the embeddings and then use a dropout layer
- For the continuous variables we will first use a 1 dimensional batch normalisation pass for the continuous variables
- We again concatenate both the encoded categorical values with the continuous values =
x = torch.cat([x, x_cont], 1)
- Finally, we set x equal to the self.layers and return x
In a nutshell – what we have done is built a way to handle continuous and categorical variables, in our network, with embeddings. We have defined the model structure and made it sequential and then indicated how the forward
Saving our MLPRegressor structure
Now we have this structure built we will save this in a separate Python file called Regression.py and this will be nested inside a models directory. Your folder structure should look like this:
Once we have this we can import it into our two main Python projects using
from models.Regression import MLPRegressor
Use the model
The following steps will load the model in and we will pass inputs to the model to get it setup:
The steps here are:
- initialise a random seed value to make the results repeatable
- create a
model
variable and pass through to our MLPRegressor model the embedding sizes, the shape of the continuous variables tensor, specify our output size (because it is a regression problem we will be outputting 1 value only), add the number of layers (this is passed as a list of values) and choose a drop out layer probability (if you don’t specify this a default value will be used, as this is an optional parameter) - Printing the model structure will show the layers in the network
Splitting our data
Before we move on to setting up the training loop we are going to split the data into train and test sets (you could add a validation split here as well):
Creating our training loop
This is the workhorse and something you will see in every PyTorch implementation. This indicates how to train the network and perform weight updates and optimisation steps. I will take you through this step-by-step after the Git Gist below:
Let’s step into this:
- First we create our function train and define the parameters needed. The optional parameters are
learning_rate
,epochs
andprint_out_interval
- We initialise criterion as a global variable, as we will need to use this later on
- Then the criterion is going to be set to Mean Squared Error loss, as this is a regression problem and we are going to track the average error across the line of regression
- For the gradient descent steps we are going to use our optimiser as ADAM and pass in our model parameters and the desired rate we want the model to learn at
- We then set a timer to start and use model.train (to specify this is the steps to apply in the training steps of the model)
- Then, I initialise empty lists to store our losses and preds (predictions) at each epoch
- From here we start the iteration through each epoch and perform the following steps:
- create a
y_pred
variable and pass the model categorical and continuous training videos - append the specific prediction to the empty list
- we create the a torch tensor to take the square root of the prediction of y vs the train y category
- we then append the losses to the list
- we use the print out interval variable to take the modulo and print out the epoch of the loss for the current epoch
- to clear the gradient tape we use the special torch function
optimizer.zero_grad()
- we then propagate the loss backwards and use an optimizer step to pick the best weights for the specific example
- finally, we create our print out steps to print the loss per epoch and the duration of the epoch
- create a
- Once the training step is done, we will evaluate on the validation data as we go. The steps taken here are:
- first we disable the graident calculation while we pass inference examples to our model using
torch.no_grad()
- then we initialise a
y_val
variable and pass the categorical validation tensor and the continuous validation tensor to the model - We take the square root of the model again
- Then, print the RMSE (Root Mean Squared Error)
- create empty lists for the predictions, differences and actuals
- then we loop through the length of the tensor
- use a numpy array to take the absolute value (abs) of the validation item vs the prediction
- get the pred from the validation prediction
- and the actuals from the
y_test
tensor - we then append the diffs, preds and actuals to the respective empty lists
- out of the loop we create a dictionary to store the predictions, differences and actuals
- first we disable the graident calculation while we pass inference examples to our model using
- finally, we save the model using the
model.state_dict()
to our model_artifacts folder and return the losses, preds, diffs, actuals, model, valid_results_dict and epochs to be used later on in the training script.
Using our training loop
Using multiple assignment we will store each one of the outputs of our train function:
Here we pass in the model, y_train (training of the outcome variable), categorical training variables, continuous training variables, this is repeated for our validation data, select a learning rate, the number of epochs and the print_out_interval. When this is triggered – the script will output the below:
View our predictions vs actuals
Next, we will visualise where our model is performant and where it is way off the mark:
This will store our results in a data.frame and then visualises the data in a scatter chart:
This is a difficult dataset, as there are many outliers and it is apparent that there appear to be bands of people with different types of medical insurance values, this would be indicative of the type of procedure they needed the medical insurance for and who require different levels of medical insurance. You could treat the outliers and repeat the training – I will leave this to you to perfect, as the aim of this post is to show how to use PyTorch to create a regression model.
Produce our model training graph
We will now see how well the model training performed:
This step uses list compression to iterate through the losses list we created in our training loop and extracts every .item() i.e. loss from the list and calls the new list losses_collapsed. We do a similar comprehension to get the number of epochs and then create a pandas data.frame.
We then save the data to csv and create the SNS chart. The chart looks as below:
This shows our model is still learning after 400 epochs, as the loss is still on the decline, we could extend the epochs further to tweak the model loss further.
We now have our training script in place, the full code is captured here:
Building our model inference script
Now we are going to use our trained model to infer from our production data. Here we will pass through multiple examples from our production medical insurance dataset. This dataset, in real life, would be the passing through new values that we want to estimate the medical insurance cost for. Moreover, we would not know the actual cost, as we are trying to make new predictions.
Feature engineering
We will repeat the same steps as the previous example, with one slight change of loading in the production medical insurance dataset:
This next step is important. We need to specify a list with tuples in for the same embeddings sizes as in the previous training script, as if we tried to do this with the inference script dynamically, we would have less of our categorical columns and there would be a shape mismatch. I have hard coded this, but you could import it as a text file, json file or similar. Again, this embedding should match our training embedding sizes for the network to work correctly:
Right, we have everything in place, such as our categorical and continuous values encoded and converted to PyTorch tensors and our embeddings have been translated from the training script to match the shapes. Please note – for your own dataset – this would need to be updated to match the shape of your categorical and continuous values.
Load and use our saved model
In the next steps we will load our saved model, with the same parameters we used for training and then load the state_dict()
from the model_artefacts folder (this could be anything you like, I just called it model artefacts):
The print of model_infer.eval()
will print out the original saved model structure:
Here you can see the importance of our embeddings matching, otherwise the model will throw a wobbly!
Define function to process our prod data
I will explain this function in more detail underneath the code:
Let’s break this down:
model
takes in the loaded model from thestate_dict()
we loaded- setting
torch.no_grad()
disables gradient calculation to allow for inference - setting the y_val to the model and passing in our categorical and continuous PyTorch tensors
- we then create an empty
preds
list to store our results - this is followed by a loop through all the production files in the dataset, where I / we:
- get the length of the production items passed through the model
- get each
.item()
from the torch tensor - append each predictions to our empty preds list and do this incrementally until the loop has finished to the end
- a boolean variable called
verbose
indicates whether you want to print the result
- after all this, the only return from the function is the preds list
Running the function we get the below print outs:
Some of these predictions are dubious, as the outliers and differences in medical insurance costs is obviously confusing the model. I would suggest to make this model more useful to create a piece wise regression model to deal with the different levels and treatment of the outliers.
The full code for the inference script is here:
We have reached the end!
Wow – congratulations on getting this far. We have covered so much content in this tutorial.
Feel free to adapt the code and create a pull request to the GitHub repository if you want to add or adapt the code in any way. Remember – the aim of this tutorial was to show how you can create a regression network in PyTorch, and not to go way in depth in the billion ways you can encode features before modelling – that would be its own tutorial.
Learning PyTorch is harder than Tensorflow, as it is very pythonic and requires you to build classes, however once you get used to it the tool is very powerful and is mostly used in my work with natural language processing at my company.
You have done well and keep on coding!
Want to share your content on python-bloggers? click here.