Python-bloggers

Building a PyTorch binary classification multi-layer perceptron from the ground up

This article was first published on Python – Hutsons-hacks , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

This assumes you know how to programme in Python and know a little about n-dimensional arrays and how to work with them in numpy (don’t worry if you don’t I got you covered).

PyTorch is a pythonic way of building Deep Learning neural networks from scratch. This is something I have been learning over the last 2 years, as historically my go to deep learning framework was Tensorflow.

For beginners to PyTorch it can be daunting to first work with the application as it forces you in the direction of building Python classes, inheritance and tensor and array programming. However, once you start to work with it you start to appreciate the power of PyTorch and how much control it gives you on the creation process of deep neural networks. I have since implemented these frameworks for Natrual Language Processing, Computer Vision, Transformers and audio classification. In this tutorial we will build up a MLP from the ground up and I will teach you what each step of my network is doing.

If you are ready – then let’s dive in! Open your mind and prepare to explore the wonderful and strange world of PyTorch.

The dataset

With this tutorial we will use a dataset from the MLDataR project on classifying whether a person has thyroid disease or not. This uses a number of variables to indicate if thyroid disease is present.

Before we start – where is the code?

There are two flavours:

The packages

We are going to need a number of different functions and packages. The first stage is to get all these imported, if they are not installed, then do not concern yourself as I have a requirements.txt file on hand to help you in the supporting GitHub repository:

from numpy import vstack
from pandas import read_csv
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, average_precision_score
from sklearn.metrics import confusion_matrix, recall_score, f1_score
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split
import torch
from torch import Tensor
from torch.nn import Linear
from torch.nn import ReLU
from torch.nn import Sigmoid
from torch.nn import Module
from torch.optim import SGD
from torch.nn import BCELoss
from torch.optim import lr_scheduler
from torch.nn.init import kaiming_uniform_
from torch.nn.init import xavier_uniform_
import time
import copy

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Here we are using numpy (for array processing), pandas (for working with data frames and series), sklearn for encoding of labels and building up model metrics, torch utilities for working with the input data, torch tensors and other elements of our MLP stack and the time module for timing how long our training loops take.

The data

The data were are going to be using for this PyTorch tutorial has already been preprocessed and consists of all the fields where I have stripped off the row headers.

This is linked to the thyroid data discussed and contains:

The link to the dataset is: https://raw.githubusercontent.com/StatsGary/Data/main/thyroid_raw.csv.

The first stage of the process is to take the data and create a PyTorch readable data object. These are called data loaders and tell PyTorch how to work with the data. This will be defined in the next steps, but to read more about data loaders, see the official tutorial site: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html.

Step one – Building our first PyTorch component – DataLoaders

This is where things get interesting and we will give chunk by chunk into what is happening under the hood.

Creating the data loader to pull in CSV files

Firstly we need to create a dataset class with one input Dataset – this is a specific PyTorch module that works with various types of data. Because we have tabular data, we will need to declare a reader to read in the file from the link above (the raw data stored on GitHub) and then we will do some conversions:

class ThryoidCSVDataset(Dataset):
    #Constructor for initially loading
    def __init__(self,path):
        df = read_csv(path, header=None)
        # Store the inputs and outputs
        self.X = df.values[:, :-1]
        self.y = df.values[:, -1] #Assuming your outcome variable is in the first column
        self.X = self.X.astype('float32')
        # Label encode the target as values 1 and 0 or sick and not sick
        self.y = LabelEncoder().fit_transform(self.y)
        self.y = self.y.astype('float32')
        self.y = self.y.reshape((len(self.y), 1))

So there are a number of things to explain here:

Magic method (dunder) in our class

Still within the class, we build two more class functions. These are what are known as dunder (magic) methods that every class has in Python:

    def __init__(self,path):
        # Get the number of rows in the dataset
    def __len__(self):
        return len(self.X)
    # Get a row at an index
    def __getitem__(self,idx):
        return [self.X[idx], self.y[idx]]

Here we have overridden the length method to provide the number of rows in our training set and we have used the __getitem__ method to pull the relevant index position (idx) of the relevant item. This will allow us to slice and retrieve elements.

Building our custom method to split our data into train and test splits

From looking at this code do you notice that when we declare self in the function parameter block we always declare it first! Also, tip, it does not have to be self you can call it myself, gary, colin, so long as it is consistent. However, it is standard practice to use the self naming convention.

     def split_data(self, split_ratio=0.2):
        test_size = round(split_ratio * len(self.X))
        train_size = len(self.X) - test_size
        return random_split(self, [train_size, test_size])

People familiar with SKLearn will be aware of the train_test_split function. Here I have replicated the same functionality. This is what the function does:

If you made it this far then well done – we have covered quite a bit of ground already. In saying that, we have only really told PyTorch how to read the data. What we do next is tell the network how we want to model the data.

The full implementation of the class we just defined, is below:

# Create a custom CSVDataset loader
class ThyroidCSVDataset(Dataset):
    #Constructor for initially loading
    def __init__(self,path):
        df = read_csv(path, header=None)
        # Store the inputs and outputs
        self.X = df.values[:, :-1]
        self.y = df.values[:, -1] #Assuming your outcome variable is in the first column
        self.X = self.X.astype('float32')
        # Label encode the target as values 1 and 0 or sick and not sick
        self.y = LabelEncoder().fit_transform(self.y)
        self.y = self.y.astype('float32')
        self.y = self.y.reshape((len(self.y), 1))

    # Get the number of rows in the dataset
    def __len__(self):
        return len(self.X)
    # Get a row at an index
    def __getitem__(self,idx):
        return [self.X[idx], self.y[idx]]

    # Create custom class method - instead of dunder methods
    def split_data(self, split_ratio=0.2):
        test_size = round(split_ratio * len(self.X))
        train_size = len(self.X) - test_size
        return random_split(self, [train_size, test_size])

Step two – defining our multi-layer perceptron ANN

Still with me? If you are then thanks, if not come back after a hot coffee, or tea (so British of me!).

A wee bit of theory

We are going to define a multilayer perceptron class, and we will implement a forward message passing layer, we call this a feed forward structure:

The idea is that we feed our inputs – in this case of independent variables, we then build what are known as hidden layers these layers pass values between them and can start to work out patterns in the data, that as mere mortals we would struggle to analyse. The more layers we expose, the longer the network takes to train due to increase complexity and number of connections. Weights are fed at various stages through the network and then updated through something called back propagation which is normally monitored through an epoch. The network learns through how big an update is made to the weights across the network – this is called learning rate and the loss is optimised and minimised through each pass through the network. How the nodes are activated in the network is controlled by an activation function, which says which nodes to activate and which nodes not to. This network simulates how the human cognitive process learns.

I have reached the end of the theory part of this blog. Let’s get on with implementing it.

Implementing the layers in our network

Firstly we will declare our class and specify the number of inputs to the network:

# Create model
class ThyroidMLP(Module):
    def __init__(self, n_inputs):
        super(ThyroidMLP, self).__init__()
        # First hidden layer
        self.hidden1 = Linear(n_inputs, 20)
        kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
        self.act1 = ReLU()
        # Second hidden layer
        self.hidden2 = Linear(20, 10)
        kaiming_uniform_(self.hidden2.weight, nonlinearity='relu')
        self.act2 = ReLU()
        # Third hidden layer
        self.hidden3 = Linear(10,1)
        xavier_uniform_(self.hidden3.weight)
        self.act3 = Sigmoid()

Lots to explain here:

Create the forward passing mechanism

You will see this in a number of PyTorch scripts – the weights are passed forward through the network and then the updates to the weights i.e. optimization is done by back propogation. Let’s put the final piece of our network together.

def forward(self, X):
        #Input to the first hidden layer
        X = self.hidden1(X)
        X = self.act1(X)
        # Second hidden layer
        X = self.hidden2(X)
        X = self.act2(X)
        # Third hidden layer
        X = self.hidden3(X)
        X = self.act3(X)
        return X

We use X to keep passing each layer to the next until we have all of our hidden layers and activation functions defined. We have now created our model class, that we will work with in the training loop.

The full implementation of this part of the class is below:

class ThyroidMLP(Module):
    def __init__(self, n_inputs):
        super(ThyroidMLP, self).__init__()
        # First hidden layer
        self.hidden1 = Linear(n_inputs, 20)
        kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
        self.act1 = ReLU()
        # Second hidden layer
        self.hidden2 = Linear(20, 10)
        kaiming_uniform_(self.hidden2.weight, nonlinearity='relu')
        self.act2 = ReLU()
        # Third hidden layer
        self.hidden3 = Linear(10,1)
        xavier_uniform_(self.hidden3.weight)
        self.act3 = Sigmoid()

    def forward(self, X):
        #Input to the first hidden layer
        X = self.hidden1(X)
        X = self.act1(X)
        # Second hidden layer
        X = self.hidden2(X)
        X = self.act2(X)
        # Third hidden layer
        X = self.hidden3(X)
        X = self.act3(X)
        return X

The fun part is coming up, definition of our training loop.

Step three – the training loop

This is how the model will train and update.

These can be as complex or simple as you want to make them. I have tried to aim it in the middle, as overly simplified would not be useful, and to complicated might fry your brain at this stage, especially if you are a beginner coming at it.

# Create training loop based off our custom class
def train_model(train_dl, model, epochs=100, lr=0.01, momentum=0.9, save_path='thyroid_best_model.pth'):
    # Define your optimisation function for reducing loss when weights are calculated 
    # and propogated through the network
    start = time.time()
    criterion = BCELoss()
    optimizer = SGD(model.parameters(), lr=lr, momentum=momentum)
    loss = 0.0

    for epoch in range(epochs):
        print('Epoch {}/{}'.format(epoch+1, epochs))
        print('-' * 10)
        model.train()
        # Iterate through training data loader
        for i, (inputs, targets) in enumerate(train_dl):
            optimizer.zero_grad()
            outputs = model(inputs)
            _, preds = torch.max(outputs.data,1) #Get the class labels
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
        torch.save(model, save_path)
    time_delta = time.time() - start
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_delta // 60, time_delta % 60
    ))
    
    return model

Here we go:

The next step is to create the loop to loop through each epoch and start to train the model:

Note: to use the GPU you would need to cast model.to(device) or model.to(“cuda”) for parallel processing.

There we go we have created the training loop. Next we need to decide how we are going to evaluate the model.

Step four – evaluating the performance of our network

The next function will be used to evaluate our PyTorch model to see if it is any good, or if we have been wasting our time for the last 20 minutes.

The function for this is contained hereunder, and as always, I will add my view of what is happening in each step:

import math
def evaluate_model(test_dl, model, beta=1.0):
    preds = []
    actuals = []

    for (i, (inputs, targets)) in enumerate(test_dl):
        #Evaluate the model on the test set
        yhat = model(inputs)
        #Retrieve a numpy weights array
        yhat = yhat.detach().numpy()
        # Extract the weights using detach to get the numerical values in an ndarray, instead of tensor
        actual = targets.numpy()
        actual = actual.reshape((len(actual), 1))
        # Round to get the class value i.e. sick vs not sick
        yhat = yhat.round()
        # Store the predictions in the empty lists initialised at the start of the class
        preds.append(yhat)
        actuals.append(actual)
    
    # Stack the predictions and actual arrays vertically
    preds, actuals = vstack(preds), vstack(actuals)
    #Calculate metrics
    cm = confusion_matrix(actuals, preds)
    # Get descriptions of tp, tn, fp, fn
    tn, fp, fn, tp = cm.ravel()
    total = sum(cm.ravel())
    
    metrics = {
        'accuracy': accuracy_score(actuals, preds),
        'AU_ROC': roc_auc_score(actuals, preds),
        'f1_score': f1_score(actuals, preds),
        'average_precision_score': average_precision_score(actuals, preds),
        'f_beta': ((1+beta**2) * precision_score(actuals, preds) * recall_score(actuals, preds)) / (beta**2 * precision_score(actuals, preds) + recall_score(actuals, preds)),
        'matthews_correlation_coefficient': (tp*tn - fp*fn) / math.sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn)),
        'precision': precision_score(actuals, preds),
        'recall': recall_score(actuals, preds),
        'true_positive_rate_TPR':recall_score(actuals, preds),
        'false_positive_rate_FPR':fp / (fp + tn) ,
        'false_discovery_rate': fp / (fp +tp),
        'false_negative_rate': fn / (fn + tp) ,
        'negative_predictive_value': tn / (tn+fn),
        'misclassification_error_rate': (fp+fn)/total ,
        'sensitivity': tp / (tp + fn),
        'specificity': tn / (tn + fp),
        #'confusion_matrix': confusion_matrix(actuals, preds), 
        'TP': tp,
        'FP': fp, 
        'FN': fn, 
        'TN': tn
    }
    return metrics, preds, actuals
        

There is a lot to go over here:

Step five – creating the prediction routine

This routine is a relatively simple function to those we have compared above. This routine takes in the row (a new list of data) as well as the relevant model and returns a prediction from the model yhat. Finally, we return a detached numpy array:

def predict(row, model):
    row = Tensor([row])
    yhat = model(row)
    # Get numpy array
    yhat = yhat.detach().numpy()
    return yhat   

This will give use a prediction for each input we pass into the model. The next step is to prepare our data ready for working with the model.

Step six – prepare the data to use with our model

We are actually at the point where we will be using our custom model structure to run our model, but first we need an additional helper function to allow us to prepare our dataset:

  def prepare_thyroid_dataset(path):
    dataset = ThyroidCSVDataset(path)
    train, test = dataset.split_data(split_ratio=0.1)
    # Prepare data loaders
    train_dl = DataLoader(train, batch_size=32, shuffle=True)
    test_dl = DataLoader(test, batch_size=1024, shuffle=False)
    return train_dl, test_dl

This function takes the path of where the csv file is stored. In our case this is on GitHub: https://raw.githubusercontent.com/StatsGary/Data/main/thyroid_raw.csv. Then the following happens:

Using our custom classes to train the model

We have prepared all the groundwork needed to build out supervised machine learning classifier. So let’s step through and call the relevant functions.

Loading the dataset

We will fetch the thyroid dataset from the blob storage. This dataset is highly imbalanced and is a Kaggle classification project, so I would expect the model to do well in predicting the negative examples and not so well in picking up whether a patient is sick. You would need to have some imbalanced label strategies in your back pocket – such as SMOTE and ROSE, but these are beyond the scope of this tutorial.

Let’s load the data:

train_dl, test_dl = prepare_thyroid_dataset('https://raw.githubusercontent.com/StatsGary/Data/main/thyroid_raw.csv')

Training the model

To train the model we will pass one input – this is the number of independent variables, or predictor variables, to use with the model. I know that the thyroid dataset has 26, so this is the input we would choose. The only thing you would need to change is this value:

# Specify the number of input dimensions
model = ThyroidMLP(26)

To configure the training run we will use the train_model function we created:

train_model(train_dl, 
            model, 
            save_path='data/thyroid_model.pth', 
            epochs=150, 
            lr=0.01)

Here I pass:

You will see the epochs running and then you will get a print out of the DAG (Directed acyclic graph).

Evaluating how well the model performs with the test data

Previously we built an evaluation model function, which had a long dictionary of model results. We are going to pass our test dataloader to this to get the results:

results = evaluate_model(test_dl, model, beta=1)
model_metrics = results[0]
metrics_df = pd.DataFrame.from_dict(model_metrics, orient='index', columns=['metric'])
metrics_df.index.name = 'metric_type'
metrics_df.reset_index(inplace=True)
metrics_df.to_csv('confusion_matrix_thyroid.csv', index=False)

To decode this:

The results are below:

#                          metric_type      metric
# 0                           accuracy    0.727273
# 1                             AU_ROC    0.497436
# 2                           f1_score    0.096386
# 3            average_precision_score    0.235493
# 4                             f_beta    0.096386
# 5   matthews_correlation_coefficient   -0.008809
# 6                          precision    0.222222
# 7                             recall    0.061538
# 8             true_positive_rate_TPR    0.061538
# 9            false_positive_rate_FPR    0.066667
# 10              false_discovery_rate    0.777778
# 11               false_negative_rate    0.938462
# 12         negative_predictive_value    0.762646
# 13      misclassification_error_rate    0.272727
# 14                       sensitivity    0.061538
# 15                       specificity    0.933333
# 16                                TP   12.000000
# 17                                FP   42.000000
# 18                                FN  183.000000
# 19                                TN  588.000000

As suspected – we have a massively imbalanced datasets, so the MLP is struggling to produce a good classification for the true positives, as the absence of positive labels is apparent.

Let’s poke our model and see what prediction we get, but looking at these results it is most likely to output that the classification of thyroid disease is negative.

Make a prediction against the model

This would be the part where if you were happy you would push the model into production and then new unseen observations would be scored against the model. I will poke it with one observation to see how it performs.

row = [0.8408678952719717,0.7480132415430958,-0.3366221139379705,-0.0938130059640389,-0.1101874782051067,-0.2098160394213988,-0.1260114177378201,-0.1118651062104989,-0.1274917875477927,-0.240146053214037,-0.2574472174396955,-0.0715198539852151,-0.0855764265990022,-0.1493202733578882,-0.0190692517849118,-0.2590488060984638,0.0,-0.1753175780014474,0.0,-0.9782211033008232,0.0,-1.3237957945784953,0.0,-0.6384998731458282,0.0,-1.209042232192488]
yhat = predict(row, model)
print('Predicted: %.3f (class=%d)' % (yhat, yhat.round()))

I have used a row in the dataframe where I know the patient is sick to test the label value of the model. I can see that my suspicions about the imbalance in the model are true:

Predicted: 0.499 (class=0)

I would never deploy this model. My next step would be to try some class rebalancing techniques.

Running the model against balanced dataset

For sake of time – I will use a dataset called Ionsphere that I know is well balanced and will show our model in a better light than this example. We will do the data prep and training in one cell and then poke the model. The ionsphere data is stored here:https://raw.githubusercontent.com/StatsGary/Data/main/ion.csv.

Prepare and train

I will add the ionsphere data in one code cell – this will show how to prepare the data, train the model, evaluate and predict against the model:

# Get the ionsphere data
train_dl, test_dl = prepare_thyroid_dataset('https://raw.githubusercontent.com/StatsGary/Data/main/ion.csv')
# Train the model
# Specify the number of input dimensions
model = ThyroidMLP(34)
# Train the model
train_model(train_dl, model, 
            save_path='data/ionsphere_model.pth', 
            epochs=150, 
            lr=0.01)

The only differences here is that I loaded a different dataset and specified the number of input dimensions differently.

Evaluate the model

I have written a little eval_model wrapper function here, as there are a couple of steps to converting the stored dictionary into a data frame:

# Evaluate the model
def eval_model(test_dl, model, cm_out_name='confusion_mat.csv',
               beta=1, export_index=False):
    results = evaluate_model(test_dl, model, beta)
    model_metrics = results[0]
    metrics_df = pd.DataFrame.from_dict(model_metrics, orient='index', columns=['metric'])
    metrics_df.index.name = 'metric_type'
    metrics_df.reset_index(inplace=True)
    metrics_df.to_csv(cm_out_name, index=export_index)
    print(metrics_df)
    return metrics_df, model_metrics, results

results = eval_model(test_dl, model)
print(results[0])

This returns the metrics data frame, the model metrics as their raw dictionary and the results of the evaluate_model function. Let’s see how our model performs:

#                          metric_type     metric
# 0                           accuracy   0.923810
# 1                             AU_ROC   0.882353
# 2                           f1_score   0.946667
# 3            average_precision_score   0.898734
# 4                             f_beta   0.946667
# 5   matthews_correlation_coefficient   0.829016
# 6                          precision   0.898734
# 7                             recall   1.000000
# 8             true_positive_rate_TPR   1.000000
# 9            false_positive_rate_FPR   0.235294
# 10              false_discovery_rate   0.101266
# 11               false_negative_rate   0.000000
# 12         negative_predictive_value   1.000000
# 13      misclassification_error_rate   0.076190
# 14                       sensitivity   1.000000
# 15                       specificity   0.764706
# 16                                TP  71.000000
# 17                                FP   8.000000
# 18                                FN   0.000000
# 19                                TN  26.000000

This model is much more balanced, and much more unrealistic than the first scenario presented, however it makes for good practice in implemeting the model with different datasets.

Predict with the model

The final step we will make a prediction with the model. I will choose a row that I know should be a positive class and let’s see how our model performs:

# Make prediction against model
row = [1,0,1,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1,-0.04549,0.50874,-0.67743,0.34432,-0.69707,-0.51685,-0.97515,0.05499,-0.62237,0.33109,-1,-0.13151,-0.45300,-0.18056,-0.35734,-0.20332,-0.26569,-0.20468,-0.18401,-0.19040,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447]
yhat = predict(row, model)
print('Predicted: %.3f (class=%d)' % (yhat, yhat.round()))

Predicted: 0.992 (class=1)

This model does so much better at predicting the right class label.

That is that! You have done excellent!

That is it! You have reached the end of this tutorial. I hope by working through this you feel confident to implement your own PyTorch module, or just use this code for your projects.

Please reach out if you need any help adapting this code. I have really enjoyed putting this together and I continue to develop in this toolset, as I love the flexibility.

All there is left to say is:

To leave a comment for the author, please follow the link and comment on their blog: Python – Hutsons-hacks .

Want to share your content on python-bloggers? click here.
Exit mobile version