MENU

Build a PyTorch regression MLP from scratch

This article was first published on Python – Hutsons-hacks , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.


Fork


Star


Watch


Download


Issue

In this post we will go through how to build a feed forward neural network from scratch, with the awesome PyTorch library. This library was developed by researchers in Meta (Facebook) to enable them to process natural language with ease. Here I will show you how to extend this to some of your more common tabular data tasks.

Where can I get the scripts?

The scripts are available:

Building the training script

The below steps will take you through how to build the training script end-to-end.

Let’s get going!!!

What data are we using?

The data we are going to use is medical insurance data, and was posted as a challenge on machine learning competition Kaggle. This platform pits ML engineers and practitioners in a contest to find the most accurate model. As this was a regression challenge the metric to optimise was chosen as the root mean squared error (RMSE).

To source the data for this tutorial directly go to my GitHub data resource (https://github.com/StatsGary/Data/blob/main/insurance.csv), however the reference to this is linked in the underlying code we are about to create.

Within the data we see 7 columns and 1338 rows, with the fields containing:

  • Age of patient seeking medical insurance
  • Sex of patient (categorical)
  • Body Mass Index (BMI) – continuous variable
  • Number of children patient has
  • Smoker – categorical variable
  • Region of patient – categorical variable
  • Charges (continuous variable) and our outcome of interest

With the charges column we are going to try and estimate the charges per patient requiring insurance.

One thing to note

Here we will focus more on building the model architecture and components. One thing we won’t be doing is extensive feature engineering, outlier analysis and feature selection, as these would lead to an entirely new article.

Importing the packages we will need

The following are the list of imports we are going to need:

import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import seaborn as sns
from datetime import datetime as dt
# Custom imports
from models.Regression import MLPRegressor

Under custom imports we will build our own MLPRegressor model and store this in a models folder to be used across multiple projects. I will detail the folder structure needed once we are at that juncture.

Data loading and initial setup

We will load the data in and perform a few steps, as well as setting our batch size, which we will use later as a parameter in our model training step.

data_name ='medical_insurance'
# Read in the medical insurance data
df = pd.read_csv('https://raw.githubusercontent.com/StatsGary/Data/main/insurance.csv')
# Drop nulls
df.dropna(axis='columns',inplace=True)
# Get number of rows
obs = len(df)
# Divide obs in half to get half batch size
batch_size = obs // 2

Here we are:

  • setting a variable with data_name to capture the current project – we will use this later on when we are saving our artifacts and models
  • creating a variable called df to store our raw GitHub user content for the insurance file
  • dropping any null values – there are many more ways you could treat this data, one of which is MICE, but as I said we will be focussing on building the model out and not the hundred other steps to getting the data into a better shape
  • getting the number of rows in the dataset
  • creating a batch size that is half of the number of observations – for better results in the model training you could tune this as a hyperparameter

The next step would be some initial feature engineering to treat the none continuous columns i.e. those with multiple levels and categorical descriptions.

Feature engineering and creating tensors

Following the code snippet I will take you through each one of the lines of code in a stepwise fashion:

#=====================================================================================
# Feature Engineering
#=====================================================================================
# Encode the categorical features
cat_cols = ['sex', 'smoker', 'region', 'children']
cont_cols = ['age', 'bmi']
# Set the target (y) column
y = ['charges']
# CONVERT CATEGORICAL COLUMNS
for cat in cat_cols:
df[cat] = df[cat].astype('category')
cats = np.stack([df[col].cat.codes.values for col in cat_cols], 1)
cats = torch.tensor(cats, dtype=torch.int64)
# Convert continuous variables to a tensor
conts = np.stack([df[col].values for col in cont_cols], 1)
conts = torch.tensor(conts, dtype=torch.float)
y = torch.tensor(df[y].values, dtype=torch.float).reshape(-1,1)

Strap yourselves in and here we go:

  • cat_cols
    cat_cols is a list of the categorical column names sex, smoker, region and children
  • cont_cols
    cont_cols is a list of continuous column names age and bmi
  • At this point it goes without saying that you would need to adapt these list values for your own use case, as this structure can be used as a template for future projects
  • y
    y is equal to our outcome column – for this example we are trying to predict the charges for each part of the medical insurance
  • The next step is to loop through our cat_cols to set all the types to category using the
    astype('category')
    astype('category')
    method
  • After we have done these steps we need to convert our dictionary objects (data frame) to a numpy array representation and we will use the stack method to stack our arrays on top of each other – here we use a list comprehension to achieve this result.
  • Once we have the
    cats
    cats variable in an array, we can simply use torch.tensor to convert this into a torch readable tensor for processing with PyTorch
  • We repeat the same steps for the continuous variables to make sure we have them tensorfied (is that even a word?)
  • Finally, we will make sure the outcome variable is also a tensor and cast to torch.float, as the outcome will have multiple decimal places after it

Set our embeddings

An embedding is a mapping of a discrete — categorical — variable to a vector of continuous numbers. In the context of neural networks, embeddings are low-dimensional, learned continuous vector representations of discrete variables. Neural network embeddings are useful because they can reduce the dimensionality of categorical variables and meaningfully represent categories in the transformed space.

Essentially, in this context they are the values from each of the feature columns mapped as embeddings in the tensor, and function slightly differently to embeddings used in NLP models and neural nets, albeit they are still a learned continuous vector. To get the embeddings of any tabular model, you can use the following snippet:

cat_szs = [len(df[col].cat.categories) for col in cat_cols]
emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]
print(emb_szs)

To be logical, the following occurs:

  • We use a list comprehension to loop through each
    co
    col in the categorical columns and take the length of the original data frames categories – essentially how many levels does each category have – this we call
    cat_szs
    cat_szs
  • Then, we use a second comprehension to get the embedding size between a minimum of 50 and the size _1 with an integer division divided by 2. This magic will then work out the embeddings needed for any tabular regression problem. Be sure to note this down, as you will need the output of this for the inference step later on. Printed out this shows the embedding sizes, for our current problem, as below:

Great – we have our embeddings for our categorical columns i.e. the number of levels each of the 4 columns goes down to. To translate this:

  • Sex has two values 1) male and 2) female
  • Smoker has two values 1) yes and 2) no
  • Region has four levels
  • Children has six levels

The next step is on to our model definition and once we have built this model backbone, we will save it in a seperate python file and store it in our folder structure so we can import it into all projects where thee

MLPRegressor
MLPRegressor is utilised.

Building our regression skeleton

I will include the whole code and then go through each step as we go along:

import torch.nn as nn
import torch
class MLPRegressor(nn.Module):
def __init__(self, emb_szs, n_cont, out_sz, layers, p=0.5):
super().__init__()
self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs])
self.emb_drop = nn.Dropout(p)
self.bn_cont = nn.BatchNorm1d(n_cont)
layerlist = []
n_emb = sum((nf for ni,nf in emb_szs))
n_in = n_emb + n_cont
for i in layers:
layerlist.append(nn.Linear(n_in,i))
layerlist.append(nn.ReLU(inplace=True))
layerlist.append(nn.BatchNorm1d(i))
layerlist.append(nn.Dropout(p))
n_in = i
layerlist.append(nn.Linear(layers[-1],out_sz))
self.layers = nn.Sequential(*layerlist)
def forward(self, x_cat, x_cont):
embeddings = []
for i,e in enumerate(self.embeds):
embeddings.append(e(x_cat[:,i]))
x = torch.cat(embeddings, 1)
x = self.emb_drop(x)
x_cont = self.bn_cont(x_cont)
x = torch.cat([x, x_cont], 1)
x = self.layers(x)
return x

Define the modelling block

Let’s dive into this:

  • The MLPRegressor takes an input as a
    nn.Module
    nn.Module
  • In the initialisation function
    __init__()
    __init__() we will pass in the embedding sizes, number of continuous variables, the output size, the layer structure (as a list) and the drop out probability (this, in essence, drops out nodes in the network at random and prevents overfitting)
  • We then use
    super().__init__()
    super().__init__() to inherit (via class inheritance) all the parameters and structure of the master class
  • This is followed by defining
    self.embed
    self.embed and using the
    nn.ModuleList
    nn.ModuleList to loop through each embedding in the embedding sizes, remember we have a list here with four tuples, each containing two values. Hence, we are setting the embedding sizes for the network
  • The dropout layer is then defined using
    self.emb_drop
    self.emb_drop and setting it to
    nn.Dropout
    nn.Dropout with the required probability to drop out at random
  • We then batch normalise the continuous variables to bring them down to a standardised scale using
    nn.BatchNorm1d()
    nn.BatchNorm1d()
  • After we have defined the neural network steps we will use an empty list to specify the layer depth in our network
  • We then set
    n_emb
    n_emb to the sum of our number of features in our embedding sizes
  • This is followed by the variable
    n_in
    n_in which adds the embeddings to our continuous features to get the full dataset

Let’s break at this point and grab a coffee! You have earned it!

Once we know the input structure we can now add to the layers by looping through each of the layers and building a sub structure underneath – this is what the code below does:

for i in layers:
layerlist.append(nn.Linear(n_in,i))
layerlist.append(nn.ReLU(inplace=True))
layerlist.append(nn.BatchNorm1d(i))
layerlist.append(nn.Dropout(p))
n_in = i
layerlist.append(nn.Linear(layers[-1],out_sz))

This will add:

  • nn.Linear
    nn.Linear layer as the first input – which will take the size of the layer and match the number of inputs, as this is a regression model, so will be related to the good old regression algorithms Galton first espoused, postulated and discovered
  • nn.Relu
    nn.Relu as the activation function to use
  • nn.BatchNorm1d
    nn.BatchNorm1d as the way of normalising each batch
  • nn.Dropout
    nn.Dropout as the random node dropout probability
  • Finally, append each one as a linear layer by the output size

The last step of the model definition is to set the

self.layers
self.layers variable to a sequential model (
nn.Sequential
nn.Sequential), as we want to process each one of the steps in sequence. Next we define the forward propagation method for our model.

Define the forward passing setup

We will project the following forward through the network and perform the following steps:

  • Initialise an empty embeddings list
  • Loop through the self.embeds variable and append the categorical values to the empty embeddings list
  • We will then concatenate the the embeddings and then use a dropout layer
  • For the continuous variables we will first use a 1 dimensional batch normalisation pass for the continuous variables
  • We again concatenate both the encoded categorical values with the continuous values =
    x = torch.cat([x, x_cont], 1)
    x = torch.cat([x, x_cont], 1)
  • Finally, we set x equal to the self.layers and return x

In a nutshell – what we have done is built a way to handle continuous and categorical variables, in our network, with embeddings. We have defined the model structure and made it sequential and then indicated how the forward

Saving our MLPRegressor structure

Now we have this structure built we will save this in a separate Python file called Regression.py and this will be nested inside a models directory. Your folder structure should look like this:

Once we have this we can import it into our two main Python projects using

from models.Regression import MLPRegressor
from models.Regression import MLPRegressor

Use the model

The following steps will load the model in and we will pass inputs to the model to get it setup:

# Use the model
torch.manual_seed(123)
model = MLPRegressor(emb_szs, conts.shape[1], out_sz=1, layers=[200,100], p=0.4)
print('[INFO] Model definition')
print(model)
print('='* 80)

The steps here are:

  • initialise a random seed value to make the results repeatable
  • create a
    model
    model variable and pass through to our MLPRegressor model the embedding sizes, the shape of the continuous variables tensor, specify our output size (because it is a regression problem we will be outputting 1 value only), add the number of layers (this is passed as a list of values) and choose a drop out layer probability (if you don’t specify this a default value will be used, as this is an optional parameter)
  • Printing the model structure will show the layers in the network

Splitting our data

Before we move on to setting up the training loop we are going to split the data into train and test sets (you could add a validation split here as well):

#=====================================================================================
# Split the data
#=====================================================================================
test_size = int(batch_size * .2)
cat_train = cats[:batch_size-test_size]
cat_test = cats[batch_size-test_size:batch_size]
con_train = conts[:batch_size-test_size]
con_test = conts[batch_size-test_size:batch_size]
y_train = y[:batch_size-test_size]
y_test = y[batch_size-test_size:batch_size]

Creating our training loop

This is the workhorse and something you will see in every PyTorch implementation. This indicates how to train the network and perform weight updates and optimisation steps. I will take you through this step-by-step after the Git Gist below:

#=====================================================================================
# Train the model
#=====================================================================================
def train(model, y_train, categorical_train, continuous_train,
y_val, categorical_valid, continuous_valid,
learning_rate=0.001, epochs=300, print_out_interval=2):
global criterion
criterion = nn.MSELoss() # we'll convert this to RMSE later
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
start_time = time.time()
model.train()
losses = []
preds = []
for i in range(epochs):
i+=1 #Zero indexing trick to start the print out at epoch 1
y_pred = model(categorical_train, continuous_train)
preds.append(y_pred)
loss = torch.sqrt(criterion(y_pred, y_train)) # RMSE
losses.append(loss)
if i%print_out_interval == 1:
print(f'epoch: {i:3} loss: {loss.item():10.8f}')
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('='*80)
print(f'epoch: {i:3} loss: {loss.item():10.8f}') # print the last line
print(f'Duration: {time.time() - start_time:.0f} seconds') # print the time elapsed
# Evaluate model
with torch.no_grad():
y_val = model(categorical_valid, continuous_valid)
loss = torch.sqrt(criterion(y_val, y_test))
print(f'RMSE: {loss:.8f}')
# Create empty list to store my results
preds = []
diffs = []
actuals = []
for i in range(len(categorical_valid)):
diff = np.abs(y_val[i].item() - y_test[i].item())
pred = y_val[i].item()
actual = y_test[i].item()
diffs.append(diff)
preds.append(pred)
actuals.append(actual)
valid_results_dict = {
'predictions': preds,
'diffs': diffs,
'actuals': actuals
}
# Save model
torch.save(model.state_dict(), f'model_artifacts/{data_name}_{epochs}.pt')
# Return components to use later
return losses, preds, diffs, actuals, model, valid_results_dict, epochs

Let’s step into this:

  • First we create our function train and define the parameters needed. The optional parameters are
    learning_rate
    learning_rate,
    epochs
    epochs and
    print_out_interval
    print_out_interval
  • We initialise criterion as a global variable, as we will need to use this later on
  • Then the criterion is going to be set to Mean Squared Error loss, as this is a regression problem and we are going to track the average error across the line of regression
  • For the gradient descent steps we are going to use our optimiser as ADAM and pass in our model parameters and the desired rate we want the model to learn at
  • We then set a timer to start and use model.train (to specify this is the steps to apply in the training steps of the model)
  • Then, I initialise empty lists to store our losses and preds (predictions) at each epoch
  • From here we start the iteration through each epoch and perform the following steps:
    • create a
      y_pred
      y_pred variable and pass the model categorical and continuous training videos
    • append the specific prediction to the empty list
    • we create the a torch tensor to take the square root of the prediction of y vs the train y category
    • we then append the losses to the list
    • we use the print out interval variable to take the modulo and print out the epoch of the loss for the current epoch
    • to clear the gradient tape we use the special torch function
      optimizer.zero_grad()
      optimizer.zero_grad()
    • we then propagate the loss backwards and use an optimizer step to pick the best weights for the specific example
    • finally, we create our print out steps to print the loss per epoch and the duration of the epoch
  • Once the training step is done, we will evaluate on the validation data as we go. The steps taken here are:
    • first we disable the graident calculation while we pass inference examples to our model using
      torch.no_grad()
      torch.no_grad()
    • then we initialise a
      y_val
      y_val variable and pass the categorical validation tensor and the continuous validation tensor to the model
    • We take the square root of the model again
    • Then, print the RMSE (Root Mean Squared Error)
    • create empty lists for the predictions, differences and actuals
    • then we loop through the length of the tensor
    • use a numpy array to take the absolute value (abs) of the validation item vs the prediction
    • get the pred from the validation prediction
    • and the actuals from the
      y_test
      y_test tensor
    • we then append the diffs, preds and actuals to the respective empty lists
    • out of the loop we create a dictionary to store the predictions, differences and actuals
  • finally, we save the model using the
    model.state_dict()
    model.state_dict() to our model_artifacts folder and return the losses, preds, diffs, actuals, model, valid_results_dict and epochs to be used later on in the training script.

Using our training loop

Using multiple assignment we will store each one of the outputs of our train function:

losses, preds, diffs, actuals, model, valid_results_dict, epochs = train(
model=model, y_train=y_train,
categorical_train=cat_train,
continuous_train=con_train,
y_val=y_test,
categorical_valid=cat_test,
continuous_valid=con_test,
learning_rate=0.01,
epochs=400,
print_out_interval=25)

Here we pass in the model, y_train (training of the outcome variable), categorical training variables, continuous training variables, this is repeated for our validation data, select a learning rate, the number of epochs and the print_out_interval. When this is triggered – the script will output the below:

View our predictions vs actuals

Next, we will visualise where our model is performant and where it is way off the mark:

valid_res = pd.DataFrame(valid_results_dict)
# Visualise results
current_time = dt.now().strftime('%Y-%m-%d_%H-%M-%S')
plt.figure()
sns.scatterplot(data=valid_res,
x='predictions', y='actuals', size='diffs', hue='diffs')#, palette='deep')
plt.savefig(f'charts/{data_name}valid_results_{current_time}.png')

This will store our results in a data.frame and then visualises the data in a scatter chart:

This is a difficult dataset, as there are many outliers and it is apparent that there appear to be bands of people with different types of medical insurance values, this would be indicative of the type of procedure they needed the medical insurance for and who require different levels of medical insurance. You could treat the outliers and repeat the training – I will leave this to you to perfect, as the aim of this post is to show how to use PyTorch to create a regression model.

Produce our model training graph

We will now see how well the model training performed:

# Produce validation graph
losses_collapsed = [losses[i].item() for i in range(epochs)]
epochs = [ep+1 for ep in range(epochs)]
eval_df = pd.DataFrame({
'epochs': epochs,
'loss': losses_collapsed
})
# Save data to csv
eval_df.to_csv(f'data/{data_name}_valid_data_{current_time}.csv', index=None)
# Create SNS chart
plt.figure()
palette = sns.color_palette("mako_r", 6)
sns.lineplot(data=eval_df, x='epochs', y='loss', palette=palette)
plt.savefig(f'charts/{data_name}_loss_chart_{current_time}.png')

This step uses list compression to iterate through the losses list we created in our training loop and extracts every .item() i.e. loss from the list and calls the new list losses_collapsed. We do a similar comprehension to get the number of epochs and then create a pandas data.frame.

We then save the data to csv and create the SNS chart. The chart looks as below:

This shows our model is still learning after 400 epochs, as the loss is still on the decline, we could extend the epochs further to tweak the model loss further.

We now have our training script in place, the full code is captured here:

import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import seaborn as sns
from datetime import datetime as dt
# Custom imports
from models.Regression import MLPRegressor
#=====================================================================================
# Data Loading
#=====================================================================================
data_name ='medical_insurance'
# Read in the medical insurance data
df = pd.read_csv('https://raw.githubusercontent.com/StatsGary/Data/main/insurance.csv')
# Drop nulls
df.dropna(axis='columns',inplace=True)
# Get number of rows
obs = len(df)
# Divide obs in half to get half batch size
batch_size = obs // 2
#=====================================================================================
# Feature Engineering
#=====================================================================================
# Encode the categorical features
cat_cols = ['sex', 'smoker', 'region', 'children']
cont_cols = ['age', 'bmi']
# Set the target (y) column
y = ['charges']
# CONVERT CATEGORICAL COLUMNS
for cat in cat_cols:
df[cat] = df[cat].astype('category')
cats = np.stack([df[col].cat.codes.values for col in cat_cols], 1)
cats = torch.tensor(cats, dtype=torch.int64)
# Convert continuous variables to a tensor
conts = np.stack([df[col].values for col in cont_cols], 1)
conts = torch.tensor(conts, dtype=torch.float)
y = torch.tensor(df[y].values, dtype=torch.float).reshape(-1,1)
# Set embedding sizes
cat_szs = [len(df[col].cat.categories) for col in cat_cols]
emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]
print(emb_szs)
print(conts.shape[1])
# Use the model
torch.manual_seed(123)
model = MLPRegressor(emb_szs, conts.shape[1], out_sz=1, layers=[200,100], p=0.4)
print('[INFO] Model definition')
print(model)
print('='* 80)
#=====================================================================================
# Split the data
#=====================================================================================
test_size = int(batch_size * .2)
cat_train = cats[:batch_size-test_size]
cat_test = cats[batch_size-test_size:batch_size]
con_train = conts[:batch_size-test_size]
con_test = conts[batch_size-test_size:batch_size]
y_train = y[:batch_size-test_size]
y_test = y[batch_size-test_size:batch_size]
#=====================================================================================
# Train the model
#=====================================================================================
def train(model, y_train, categorical_train, continuous_train,
y_val, categorical_valid, continuous_valid,
learning_rate=0.001, epochs=300, print_out_interval=2):
global criterion
criterion = nn.MSELoss() # we'll convert this to RMSE later
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
start_time = time.time()
model.train()
losses = []
preds = []
for i in range(epochs):
i+=1 #Zero indexing trick to start the print out at epoch 1
y_pred = model(categorical_train, continuous_train)
preds.append(y_pred)
loss = torch.sqrt(criterion(y_pred, y_train)) # RMSE
losses.append(loss)
if i%print_out_interval == 1:
print(f'epoch: {i:3} loss: {loss.item():10.8f}')
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('='*80)
print(f'epoch: {i:3} loss: {loss.item():10.8f}') # print the last line
print(f'Duration: {time.time() - start_time:.0f} seconds') # print the time elapsed
# Evaluate model
with torch.no_grad():
y_val = model(categorical_valid, continuous_valid)
loss = torch.sqrt(criterion(y_val, y_test))
print(f'RMSE: {loss:.8f}')
# Create empty list to store my results
preds = []
diffs = []
actuals = []
for i in range(len(categorical_valid)):
diff = np.abs(y_val[i].item() - y_test[i].item())
pred = y_val[i].item()
actual = y_test[i].item()
diffs.append(diff)
preds.append(pred)
actuals.append(actual)
valid_results_dict = {
'predictions': preds,
'diffs': diffs,
'actuals': actuals
}
# Save model
torch.save(model.state_dict(), f'model_artifacts/{data_name}_{epochs}.pt')
# Return components to use later
return losses, preds, diffs, actuals, model, valid_results_dict, epochs
# Use the training function to train the model
losses, preds, diffs, actuals, model, valid_results_dict, epochs = train(
model=model, y_train=y_train,
categorical_train=cat_train,
continuous_train=con_train,
y_val=y_test,
categorical_valid=cat_test,
continuous_valid=con_test,
learning_rate=0.01,
epochs=400,
print_out_interval=25)
#=====================================================================================
# Validate the model
#=====================================================================================
valid_res = pd.DataFrame(valid_results_dict)
# Visualise results
current_time = dt.now().strftime('%Y-%m-%d_%H-%M-%S')
plt.figure()
sns.scatterplot(data=valid_res,
x='predictions', y='actuals', size='diffs', hue='diffs')#, palette='deep')
plt.savefig(f'charts/{data_name}valid_results_{current_time}.png')
# Produce validation graph
losses_collapsed = [losses[i].item() for i in range(epochs)]
epochs = [ep+1 for ep in range(epochs)]
eval_df = pd.DataFrame({
'epochs': epochs,
'loss': losses_collapsed
})
# Save data to csv
eval_df.to_csv(f'data/{data_name}_valid_data_{current_time}.csv', index=None)
# Create SNS chart
plt.figure()
palette = sns.color_palette("mako_r", 6)
sns.lineplot(data=eval_df, x='epochs', y='loss', palette=palette)
plt.savefig(f'charts/{data_name}_loss_chart_{current_time}.png')

Building our model inference script

Now we are going to use our trained model to infer from our production data. Here we will pass through multiple examples from our production medical insurance dataset. This dataset, in real life, would be the passing through new values that we want to estimate the medical insurance cost for. Moreover, we would not know the actual cost, as we are trying to make new predictions.

Feature engineering

We will repeat the same steps as the previous example, with one slight change of loading in the production medical insurance dataset:

import torch
import torch.nn as nn
import numpy as np
import pandas as pd
#Custom model imports
from models.Regression import MLPRegressor
data_name ='medical_insurance'
# Read in the medical insurance data
df = pd.read_csv('https://raw.githubusercontent.com/StatsGary/Data/main/insurance_prod.csv')
# Drop nulls
df.dropna(axis='columns',inplace=True)
# Get number of rows
obs = len(df)
#=====================================================================================
# Feature Engineering
#=====================================================================================
# Encode the categorical features
cat_cols = ['sex', 'smoker', 'region', 'children']
cont_cols = ['age', 'bmi']
# Set the target (y) column
y = ['charges']
# CONVERT CATEGORICAL COLUMNS
for cat in cat_cols:
df[cat] = df[cat].astype('category')
cats = np.stack([df[col].cat.codes.values for col in cat_cols], 1)
cats = torch.tensor(cats, dtype=torch.int64)
# Convert continuous variables to a tensor
conts = np.stack([df[col].values for col in cont_cols], 1)
conts = torch.tensor(conts, dtype=torch.float)
# Create outcome
y = torch.tensor(df[y].values, dtype=torch.float).reshape(-1,1)
# Set embedding sizes
cat_szs = [len(df[col].cat.categories) for col in cat_cols]

This next step is important. We need to specify a list with tuples in for the same embeddings sizes as in the previous training script, as if we tried to do this with the inference script dynamically, we would have less of our categorical columns and there would be a shape mismatch. I have hard coded this, but you could import it as a text file, json file or similar. Again, this embedding should match our training embedding sizes for the network to work correctly:

# Don't do this commented line
#emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]
# Do this instead
emb_szs = [(2, 1), (2, 1), (4, 2), (6, 3)]

Right, we have everything in place, such as our categorical and continuous values encoded and converted to PyTorch tensors and our embeddings have been translated from the training script to match the shapes. Please note – for your own dataset – this would need to be updated to match the shape of your categorical and continuous values.

Load and use our saved model

In the next steps we will load our saved model, with the same parameters we used for training and then load the

state_dict()
state_dict() from the model_artefacts folder (this could be anything you like, I just called it model artefacts):

# Instantiate inference model
model_infer = MLPRegressor(emb_szs, conts.shape[1], 1, [200,100], p=0.4)
model_infer.load_state_dict(torch.load('model_artifacts/medical_insurance_400.pt'))
print(model_infer.eval())

The print of

model_infer.eval()
model_infer.eval() will print out the original saved model structure:

Here you can see the importance of our embeddings matching, otherwise the model will throw a wobbly!

Define function to process our prod data

I will explain this function in more detail underneath the code:

def prod_data(model, cat_prod, cont_prod, verbose=False):
# Pass the inputs from the cont and cat tensors to function
with torch.no_grad():
y_val = model(cat_prod, cont_prod)
# Get preds on prod data
preds = []
for i in range(len(cat_prod)):
result = y_val[i].item()
preds.append(result)
if verbose == True:
print(f'The predicted value is: {y_val[i].item()}')
return preds

Let’s break this down:

  • model
    model takes in the loaded model from the
    state_dict()
    state_dict() we loaded
  • setting
    torch.no_grad()
    torch.no_grad() disables gradient calculation to allow for inference
  • setting the y_val to the model and passing in our categorical and continuous PyTorch tensors
  • we then create an empty
    preds
    preds list to store our results
  • this is followed by a loop through all the production files in the dataset, where I / we:
    • get the length of the production items passed through the model
    • get each
      .item()
      .item() from the torch tensor
    • append each predictions to our empty preds list and do this incrementally until the loop has finished to the end
    • a boolean variable called
      verbose
      verbose indicates whether you want to print the result
  • after all this, the only return from the function is the preds list

Running the function we get the below print outs:

Some of these predictions are dubious, as the outliers and differences in medical insurance costs is obviously confusing the model. I would suggest to make this model more useful to create a piece wise regression model to deal with the different levels and treatment of the outliers.

The full code for the inference script is here:

import torch
import torch.nn as nn
import numpy as np
import pandas as pd
#Custom model imports
from models.Regression import MLPRegressor
data_name ='medical_insurance'
# Read in the medical insurance data
df = pd.read_csv('https://raw.githubusercontent.com/StatsGary/Data/main/insurance_prod.csv')
# Drop nulls
df.dropna(axis='columns',inplace=True)
# Get number of rows
obs = len(df)
#=====================================================================================
# Feature Engineering
#=====================================================================================
# Encode the categorical features
cat_cols = ['sex', 'smoker', 'region', 'children']
cont_cols = ['age', 'bmi']
# Set the target (y) column
y = ['charges']
# CONVERT CATEGORICAL COLUMNS
for cat in cat_cols:
df[cat] = df[cat].astype('category')
cats = np.stack([df[col].cat.codes.values for col in cat_cols], 1)
cats = torch.tensor(cats, dtype=torch.int64)
# Convert continuous variables to a tensor
conts = np.stack([df[col].values for col in cont_cols], 1)
conts = torch.tensor(conts, dtype=torch.float)
# Create outcome
y = torch.tensor(df[y].values, dtype=torch.float).reshape(-1,1)
# Set embedding sizes
cat_szs = [len(df[col].cat.categories) for col in cat_cols]
#emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]
emb_szs = [(2, 1), (2, 1), (4, 2), (6, 3)]
# Instantiate inference model
model_infer = MLPRegressor(emb_szs, conts.shape[1], 1, [200,100], p=0.4)
model_infer.load_state_dict(torch.load('model_artifacts/medical_insurance_400.pt'))
print(model_infer.eval())
def prod_data(model, cat_prod, cont_prod, verbose=False):
# Pass the inputs from the cont and cat tensors to function
with torch.no_grad():
y_val = model(cat_prod, cont_prod)
# Get preds on prod data
preds = []
for i in range(len(cat_prod)):
result = y_val[i].item()
preds.append(result)
if verbose == True:
print(f'The predicted value is: {y_val[i].item()}')
return preds
# Use prod data function
prod = prod_data(model_infer, cats, conts, verbose=True)
# Print out prod
print(prod)

We have reached the end!

Wow – congratulations on getting this far. We have covered so much content in this tutorial.

Feel free to adapt the code and create a pull request to the GitHub repository if you want to add or adapt the code in any way. Remember – the aim of this tutorial was to show how you can create a regression network in PyTorch, and not to go way in depth in the billion ways you can encode features before modelling – that would be its own tutorial.

Learning PyTorch is harder than Tensorflow, as it is very pythonic and requires you to build classes, however once you get used to it the tool is very powerful and is mostly used in my work with natural language processing at my company.

You have done well and keep on coding!

To leave a comment for the author, please follow the link and comment on their blog: Python – Hutsons-hacks .

Want to share your content on python-bloggers? click here.