Gradient Boosting Multi-Class Classification from Scratch

Posted on October 14, 2023 by Matt Bowers in Data science | 0 Comments

This article was first published on Random Realizations , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Tell me dear reader, who among us, while gazing in wonder at the improbably verdant aloe vera clinging to the windswept rock at Cape Point near the southern tip of Africa, hasn’t wondered: how the heck do gradient boosting trees implement multi-class classification? Today, we’ll unravel this mystery by reviewing the theory and implementing the algorithm for ourselves in python. Specifically, we’ll review the multi-class gradient boosting model originally described in Friedman’s classic Greedy Function Approximation paper, and we’ll implement components of the algorithm as we go along. Once we have all the pieces, we’ll write a python class for multi-class gradient boosting with a similar API to the scikit-learn GradientBoostingClassifier.

If you need a refresher on gradient boosting before diving in here, then start with my original gradient boosting from scratch post, which is the first installment in my ongoing series on gradient boosting.

The multi-class gradient boosting classification algorithm

Friedman describes the algorithm for training a multi-class classification gradient boosting model in Algorithm 6 of the classic Greedy Function Approximation paper. If you want a step-by-step walkthrough of the ideas in the paper, have a look at my post on the generalized gradient boosting algorithm. In high-level terms, the algorithm for multi-class gradient boosting is:

Set the initial model predictions.
Repeat the following for each boosting round.
Repeat the following for each class.
Compute the pseudo residuals.
Train a regression tree to predict the pseudo residuals.
Adjust the tree’s predicted values to optimize the objective function.
Add the new tree to the current composite model.

Let’s take a look at the details for each of these steps.

Target variable encoding

Following the convention in scikit-learn, when training a multi-class classifier, the target variable in the training dataset should be integer encoded so that the $K$ distinct classes are mapped to the integers $0,1,\dots,K-1$ . In the code for model training, however, it’s going to be more convenient to work with a one hot encoded representation of the target. Therefore we’ll start by writing an internal method to transform the target variable from integer encoding to one hot encoding. Remember that eventually we’ll write a class for our multi-class gradient boosting model, so I’ll write this function like a class method with a leading argument called self.

import numpy as np
import pandas as pd 
from sklearn.preprocessing import OneHotEncoder

def _one_hot_encode_labels(self, y):
    if isinstance(y, pd.Series): y = y.values
    ohe = OneHotEncoder()
    y_ohe = ohe.fit_transform(y.reshape(-1, 1)).toarray()
    return y_ohe

This code takes the integer-encoded target variable, makes sure it’s a numpy array, then uses cikit-learn’s one hot encoder to encode it as a 2D array with observations along the first axis and classes along the second axis. I tend to think of the one hot encoded output as a matrix with $n$ rows (the number of observations in the training data) and $K$ columns (the number of classes), although it’s technically not a matrix but rather a 2D array.

Model predictions in raw space and probability space

In amulti-class classification problem with $K$ classes, the model prediction for a particular observation returns a list of $K$ probabilities, one for each class. Essentially the model prediction is a conditional probability mass function for the discrete target variable, conditioned on the feature values.

So, we need a way to ensure that the model output is a valid probability mass function, i.e. each probability is in (0, 1) and the $K$ class probabilities sum to 1. Analogous to logistic regression, we can accomplish this by using the model to first make a raw prediction which can be any real number, then using something like the inverse logit function to transform the raw model prediction into a number between 0 and 1 that can be interpreted as a probability. Again analogous to logistic regression, in the multi-class setting we use $K$ different models, one for each class, to generate the raw predictions, then we transform the raw model predictions into probabilities using the softmax function,, which takes a length- $K$ vector of real numbers as input and returns a probability mass function over $K$ discrete classes.

Let $\{F_1(\mathbf{x}),\dots,F_K(\mathbf{x})\}=\{F_k(\mathbf{x})\}_1^K$ be the list of $K$ raw model outputs, and let $\{p_1(\mathbf{x}),\dots,p_K(\mathbf{x})\}=\{p_k(\mathbf{x})\}_1^K$ be the corresponding probability mass function over the $K$ classes, then the softmax function is defined as

$p_k(\mathbf{x}) = \text{softmax}_k(\{F_k(\mathbf{x})\}_1^K) = \frac{e^{F_k(\mathbf{x})}}{\sum_{l=1}^K e^{F_l(\mathbf{x})}}$

Let's implement an internal softmax method that transforms the raw predictions into probabilities.

def _softmax(self, raw_predictions):
    numerator = np.exp(raw_predictions) 
    denominator = np.sum(np.exp(raw_predictions), axis=1).reshape(-1, 1)
    return numerator / denominator

Initial model predictions

We’re now ready to implement model training, starting with line 1 of the algorithm which sets the initial model predictions. In our code, we’ll keep the raw model predictions $\{F_k(\mathbf{x})\}_1^K$ for the $n$ observations in the training dataset in a size $n \times K$ array called raw_predictions, and we’ll keep the corresponding probabilities $\{p_k(\mathbf{x})\}_1^K$ in another $n \times K$ array called probabilities. Perhaps the simplest reasonable initialization is to set the probabilities to $1/K$ , i.e. $p_k(\mathbf{x})=1/K$ , which implies $F_k(\mathbf{x})=0$ .

We’ll go ahead and create that one hot encoded representation of the target, then use it to set the right size for the model prediction arrays.

y_ohe = self._one_hot_encode_labels(y)
raw_predictions = np.zeros(shape=y_ohe.shape)
probabilities = self._softmax(raw_predictions)

Boosting

Line 2 of the algorithm kicks off a loop to iteratively perform boosting rounds. Within each round, line 3 specifies that we iterate through each of the $K$ classes, adding a new booster model for each class at each boosting round. We’ll keep all the boosters in a list called boosters, where each element is itself a list which we’ll call class_trees that contains the $K$ trees we trained in a given boosting round. For each round and each class, we compute the pseudo residuals (negative gradients), train a decision tree to predict them, update the tree’s predicted values to optimize the overall objective function, then update the current raw and probability predictions before storing the new tree in that round’s list of class trees.

self.boosters = []
for m in range(self.n_estimators):
    class_trees = []
    for k in range(self.n_classes):
        negative_gradients = self._negative_gradients(y_ohe[:, k], probabilities[:, k])
        hessians = self._hessians(probabilities[:, k])
        tree = DecisionTreeRegressor(max_depth=self.max_depth)
        tree.fit(X, negative_gradients);
        self._update_terminal_nodes(tree, X, negative_gradients, hessians)
        raw_predictions[:, k] += self.learning_rate * tree.predict(X)
        probabilities = self._softmax(raw_predictions)
        class_trees.append(tree)
    self.boosters.append(class_trees)

Next we’ll dive into the details of the pseudo residual computation and the adjustment to the tree booster predicted values.

Pseudo Residuals

For each observation in the training dataset, the pseudo residual is the negative gradient of the objective function with respect to the corresponding model prediction. The objective function for multi-class classification is the Multinomial Negative Log Likelihood. For a single observation, the objective is

$J(\{ y_k, p_k(\mathbf{x}) \}_1^K) = -\sum_{k=1}^K y_k \log p_k(\mathbf{x})$

We can rewrite the objective in terms of our raw model output $F$ like this.

$J(\{ y_k, F_k(\mathbf{x}) \}_1^K) = -\sum_{k=1}^K y_k \log \frac{e^{F_k(\mathbf{x})}}{\sum_{l=1}^K e^{F_l(\mathbf{x})}}$

The negative gradient of the objective with respect to raw model prediction $F_k(\mathbf{x}_i)$ for training example $i$ is given by

$r_{ik} = -J'(F_k(\mathbf{x}_i)) = -\left[ \frac{\partial J(\{ y_{il}, F_l(\mathbf{x_i})\}_{l=1}^K)}{\partial F_k(\mathbf{x}_i) } \right] =y_{ik} - p_{k}(\mathbf{x}_i)$

You can take a look at the derivation if you’re curious how to work it out yourself. Note that this formula has a nice intuition. When $y_{ik}=1$ , if predicted probability $p_k(\mathbf{x}_i)$ is terrible and close to 0, then the pseudo residual will be positive, and the next boosting round will try to increase the predicted probability. Otherwise if the predicted probability is already good and close to 1, the pseudo residual will be close to 0 and the next boosting round won’t change the predicted probability very much.

We can easily implement an internal method to compute the negative gradients over the training dataset as follows.

def _negative_gradients(self, y_ohe, probabilities):
    return y_ohe - probabilities

Adjusting the trees’ predicted values

After training a regression tree to predict the pseudo residuals, we need to adjust the predicted values in its terminal nodes to optimize the overall objective function. In the Greedy Function Approximation paper, Friedman actually specifies finding the optimal value using a numerical optimization routine like line search. We could express that like

$v = \text{argmin}_v \sum_{i \in t} J(y_{i}, F(\mathbf{x}_i) + v)$

where $t$ is the set of samples falling into this terminal node.

In the scikit-learn implementation of gradient boosting classification, the authors instead use the approach from FHT00 which uses a single Newton descent step to approximate the optimal predicted value for each terminal node. See code and comments for the function _update_terminal_regions in the scikit-learn gradient boosting module. The updated value is computed like

$v = -\frac{\sum_{i \in t} J'(F(\mathbf{x}_i))}{\sum_{i \in t} J''(F(\mathbf{x}_i))}$

We already found the first derivative of the objective, so we just need to calculate the second derivative.

$J''(F_k(\mathbf{x}_i)) = \left[ \frac{\partial J(\{ y_{il}, F_l(\mathbf{x_i})\}_{l=1}^K)}{\partial ^2 F_k(\mathbf{x}_i) } \right] = p_k(\mathbf{x}_i) (1 - p_k(\mathbf{x}_i))$

Here’s the internal method to compute the second derivative .

def _hessians(self, probabilities): 
    return probabilities * (1 - probabilities)

Then we can implement the internal method for updating the tree predicted values. I give more details about how to manually set scikit-learn’s decision tree predicted values in the post on gradient boosting with any loss function.

def _update_terminal_nodes(self, tree, X, negative_gradients, hessians):
    '''Update the terminal node predicted values'''
    # terminal node id's
    leaf_nodes = np.nonzero(tree.tree_.children_left == -1)[0]
    # compute leaf for each sample in ``X``.
    leaf_node_for_each_sample = tree.apply(X)
    for leaf in leaf_nodes:
        samples_in_this_leaf = np.where(leaf_node_for_each_sample == leaf)[0]
        negative_gradients_in_leaf = negative_gradients.take(samples_in_this_leaf, axis=0)
        hessians_in_leaf = hessians.take(samples_in_this_leaf, axis=0)
        val = np.sum(negative_gradients_in_leaf) / np.sum(hessians_in_leaf)
        tree.tree_.value[leaf, 0, 0] = val

Prediction

At inference time, the user supplies an X with multiple observations of the feature variables, and our model needs to issue a prediction for each observation. We’ll start by implementing the predict_proba method, which takes X as input and returns a length- $K$ probability mass function for each observation in X. To do this, we’ll initialize the raw predictions with zeros, just as we did in training, and then for each class, we’ll loop through all the boosters, collecting their predictions on X, scaling by the learning rate, and summing them up. Finally, we use the softmax to transform raw predictions into the probabilities.

def predict_proba(self, X):
    '''Generate probability predictions for the given input data.'''
    raw_predictions =  np.zeros(shape=(X.shape[0], self.n_classes))
    for k in range(self.n_classes):
        for booster in self.boosters:
            raw_predictions[:, k] +=self.learning_rate * booster[k].predict(X)
    probabilities = self._softmax(raw_predictions)
    return probabilities

Then to get the predicted labels, we can use the predict_proba method to generate probabilities, simply returning the integer-encoded class label of the largest probability for each observation in X.

def predict(self, X):
    '''Generate predicted labels (as integer-encoded array)'''
    probabilities = self.predict_proba(X)
    return np.argmax(probabilities, axis=1)

The complete multi-class gradient boosting classification model implementation

Now we’re ready to implement a multi-class classification gradient boosting model class with public fit, predict_proba, and predict methods. We combine the components above into a fit method for model training, and we add the two prediction methods to complete the model’s functionality.

import numpy as np
import pandas as pd 
from sklearn.tree import DecisionTreeRegressor 
from sklearn.preprocessing import OneHotEncoder

class GradientBoostingClassifierFromScratch():
    '''Gradient Boosting Classifier from Scratch.
    
    Parameters
    ----------
    n_estimators : int
        number of boosting rounds
        
    learning_rate : float
        learning rate hyperparameter
        
    max_depth : int
        maximum tree depth
    '''
    
    def __init__(self, n_estimators, learning_rate=0.1, max_depth=1):
        self.n_estimators=n_estimators; 
        self.learning_rate=learning_rate
        self.max_depth=max_depth;
    
    def fit(self, X, y):
        '''Fit the GBM
        
        Parameters
        ----------
        X : ndarray of size (number observations, number features)
            design matrix
            
        y : ndarray of size (number observations,)
            integer-encoded target labels in {0,1,...,k-1}
        '''
        
        self.n_classes = pd.Series(y).nunique()
        y_ohe = self._one_hot_encode_labels(y)

        raw_predictions = np.zeros(shape=y_ohe.shape)
        probabilities = self._softmax(raw_predictions)
        self.boosters = []
        for m in range(self.n_estimators):
            class_trees = []
            for k in range(self.n_classes):
                negative_gradients = self._negative_gradients(y_ohe[:, k], probabilities[:, k])
                hessians = self._hessians(probabilities[:, k])
                tree = DecisionTreeRegressor(max_depth=self.max_depth)
                tree.fit(X, negative_gradients);
                self._update_terminal_nodes(tree, X, negative_gradients, hessians)
                raw_predictions[:, k] += self.learning_rate * tree.predict(X)
                probabilities = self._softmax(raw_predictions)
                class_trees.append(tree)
            self.boosters.append(class_trees)
    
    def _one_hot_encode_labels(self, y):
        if isinstance(y, pd.Series): y = y.values
        ohe = OneHotEncoder()
        y_ohe = ohe.fit_transform(y.reshape(-1, 1)).toarray()
        return y_ohe
        
    def _negative_gradients(self, y_ohe, probabilities):
        return y_ohe - probabilities
    
    def _hessians(self, probabilities): 
        return probabilities * (1 - probabilities)

    def _softmax(self, raw_predictions):
        numerator = np.exp(raw_predictions) 
        denominator = np.sum(np.exp(raw_predictions), axis=1).reshape(-1, 1)
        return numerator / denominator
        
    def _update_terminal_nodes(self, tree, X, negative_gradients, hessians):
        '''Update the terminal node predicted values'''
        # terminal node id's
        leaf_nodes = np.nonzero(tree.tree_.children_left == -1)[0]
        # compute leaf for each sample in ``X``.
        leaf_node_for_each_sample = tree.apply(X)
        for leaf in leaf_nodes:
            samples_in_this_leaf = np.where(leaf_node_for_each_sample == leaf)[0]
            negative_gradients_in_leaf = negative_gradients.take(samples_in_this_leaf, axis=0)
            hessians_in_leaf = hessians.take(samples_in_this_leaf, axis=0)
            val = np.sum(negative_gradients_in_leaf) / np.sum(hessians_in_leaf)
            tree.tree_.value[leaf, 0, 0] = val
          
    def predict_proba(self, X):
        '''Generate probability predictions for the given input data.'''
        raw_predictions =  np.zeros(shape=(X.shape[0], self.n_classes))
        for k in range(self.n_classes):
            for booster in self.boosters:
                raw_predictions[:, k] +=self.learning_rate * booster[k].predict(X)
        probabilities = self._softmax(raw_predictions)
        return probabilities
        
    def predict(self, X):
        '''Generate predicted labels (as 1-d array)'''
        probabilities = self.predict_proba(X)
        return np.argmax(probabilities, axis=1)

Testing our implementation

Let’s test our implementation alongside the scikit-learn GradientBoostingClassifier to ensure it works as expected.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=10000, 
                           n_classes=5, 
                           n_features=20,
                           n_informative=10,
                           random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=10, 
                                 learning_rate=0.3, 
                                 max_depth=6)
gbc.fit(X_train, y_train)
accuracy_score(y_test, gbc.predict(X_test))

0.7756

gbcfs = GradientBoostingClassifierFromScratch(n_estimators=10, 
                                              learning_rate=0.3, 
                                              max_depth=6)
gbcfs.fit(X_train, y_train)
accuracy_score(y_test, gbcfs.predict(X_test))

0.7768

Beautiful. Our implementation is performing comparably to the sklearn gradient boosting classifier!

Wrapping Up

Well, there you have it, another epic scratch build for the books. I think the most interesting thing about the multi-class gradient boosting algorithm is that it generates multi-dimensional predictions based on a single objective function by training multiple decision trees in each boosting round. That’s a very interesting extension of the classic gradient boosting machine! If you have questions about the implementation, or if you found this post helpful, please leave a comment below to tell me about it.

References

To leave a comment for the author, please follow the link and comment on their blog: Random Realizations .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers