Want to share your content on python-bloggers? click here.
Backpropagation is a key algorithm used in training fully connected neural networks, also known as feed-forward neural networks. In this algorithm, the network’s output error is propagated backward, layer by layer, to adjust the weights of connections between neurons.
The process starts by comparing the network’s output to the desired output, calculating the error. Then, starting from the output layer and moving backward, the algorithm computes the gradients of the error with respect to each weight in the network using the chain rule of calculus. These gradients indicate how much each weight contributes to the error.
Next, the weights are updated using gradient descent, where they are adjusted in the direction that minimizes the error. This adjustment is proportional to the gradient and a predefined learning rate, ensuring the network converges towards a solution. Backpropagation continues iteratively over the training data until the network’s performance reaches a satisfactory level or a predetermined number of iterations is reached.
Overall, backpropagation efficiently adjusts the weights of a fully connected network, enabling it to learn complex relationships between input and output data through iterative optimization of the network’s parameters.
In what follows, we walkthrough the mathematics and pseudocode required to train a 2-layer fully connected network for a classification task.
Forward Pass
In the following, superscripts represent the layer associated with each variable:
-
: Input data having dimension n-by-f, where n is the number of samples and f the number of features. For a batch of 32 MNIST samples, would have dimension (32, 784). -
: Target variable. classifying a single digit from MINST, a vector populated with 0s and 1s indicating the ground truth label for the sample (8 or not 8). Has the same length as the first dimension of . -
: Trainable weights. Projects previous layer activations to lower dimensional representation. Again referring to the first set of weights for a batch of 32 MNIST samples, ’s first dimension will match the second dimension of the activations from the previous layer (784), and ’s second dimension will be some lower dimension, say 256. will therefore have dimension (784, 256). -
: Bias term, a one-dimensional vector associated with each hidden layer having length equal to the second dimension of the hidden layer. will have dimension (256,). -
: Output of layer , which is the matrix product of the previous layer activations and current layer weights (plus bias term). -
: Activations associated with layer . Passes through a non-linearity such as sigmoid or ReLU.
More concretely, assume a 2-layer fully-connected neural network with one hidden layer of size 256, through which a dataset of dimension 32-by-784 is passed to predict whether each of the 32 images is an 8 or not. The forward pass looks like:
- Randomly initialize
(784×256), (256×1), (256×1) and (1×1) (32×784) (32×256) (32×256) (32×1) (32×1)
The final output,
With the actual labels
Backward Pass (Backpropagation)
The goal of backpropagation is to compute the partial derivatives of the loss function
Once we have
for some learning rate
Let’s start with unpacking
The second term on the r.h.s.,
therefore
For the third term on the r.h.s.,
Finally, we have
As a notational convenience, we define
This way,
We proceed in a similar fashion for
since
For the first layer we re-use many of these calculations, but for new terms on the r.h.s., we employ the chain rule in the same way. For reference, restate the terms from the forward pass:
We next consider
Considering each term on the r.h.s:
Resulting in:
As before, we define
which allows us to write
Similarly for
Considering each term on the r.h.s:
Therefore
To complete the backpropagation algorithm, it is necessary to define
Assume
import numpy as np def sigmoid(X): """ Compute the sigmoid activation for the input. """ return 1 / (1 + np.exp(-X)) def sigmoid_dev(X): """ The analytical derivative of sigmoid function at X. """ return sigmoid(X) * (1 - sigmoid(X)) def softmax(scores): """ Compute softmax scores given the raw output from the model. Returns softmax probabilities (N, num_classes). """ numer = np.exp(scores - scores.max(axis=1, keepdims=True)) denom = numer.sum(axis=1, keepdims=True) return np.divide(numer, denom) def cross_entropy_loss(ypred, yactual): """ Compute Cross-Entropy Loss based on prediction of the network and labels """ yactual = np.asarray(yactual) ypred = ypred[np.arange(len(yactual)), yactual] return -np.mean(np.log(ypred)) def compute_accuracy(ypred, yactual): """ Compute the accuracy of current batch. """ yactual = np.asarray(yactual) yhat = np.argmax(ypred, axis=1) return (y == yhat).sum() / y.shape[0]
# Stand in for batch of 32 MNIST images. X = np.random.randint(0, 256, size=(32, 784)) y = np.random.randint(0, 10, size=32) # Reshape labels to 32 x 10. Y = np.zeros((32, 10)) Y[np.arange(X.shape[0]), y] = 1 # (32, 10) # Learning rate. alpha = .05 # Initialize weights. b1 = np.zeros(256) b2 = np.zeros(10) W1 = 0.001 * np.random.randn(784, 256) W2 = 0.001 * np.random.randn(256, 10) # Forward pass. Z1 = X @ W1 + b1 # (32, 256) A1 = sigmoid(Z1) # (32, 256) Z2 = A1 @ W2 + b2 # (32, 10) A2 = softmax(Z2) # (32, 10) # Compute loss and accuracy. loss = cross_entropy_loss(A2, y) accuracy = compute_accuracy(A2, y) # Backward pass. dZ2 = A2 - Y # (32, 10) dW2 = (A1.T @ dZ2) / 32 # (256, 10) db2 = np.sum(dZ2, axis=0) / 32 # (10,) dA1 = dZ2 @ W2.T # (32, 256) dZ1 = np.multiply(dA1, sigmoid_dev(Z1)) # (32, 256) dW1 = (X.T @ dZ1) / 32 # (784, 256) db1 = np.sum(dZ1, axis=0) / 32 # (256,) # Update weights. W2 = W2 - alpha * dW2 b2 = b2 - alpha * db2 W1 = W1 - alpha * dW1 b1 = b1 - alpha * db1
The code starting with the forward pass would be iterated over a set of batches for a pre-determined number of epochs. The final weights would then be used for inference.
Want to share your content on python-bloggers? click here.