Gradient Boosting Multi-Class Classification from Scratch
Want to share your content on python-bloggers? click here.
Tell me dear reader, who among us, while gazing in wonder at the improbably verdant aloe vera clinging to the windswept rock at Cape Point near the southern tip of Africa, hasn’t wondered: how the heck do gradient boosting trees implement multi-class classification? Today, we’ll unravel this mystery by reviewing the theory and implementing the algorithm for ourselves in python. Specifically, we’ll review the multi-class gradient boosting model originally described in Friedman’s classic Greedy Function Approximation paper, and we’ll implement components of the algorithm as we go along. Once we have all the pieces, we’ll write a python class for multi-class gradient boosting with a similar API to the scikit-learn GradientBoostingClassifier
.
If you need a refresher on gradient boosting before diving in here, then start with my original gradient boosting from scratch post, which is the first installment in my ongoing series on gradient boosting.
The multi-class gradient boosting classification algorithm
Friedman describes the algorithm for training a multi-class classification gradient boosting model in Algorithm 6 of the classic Greedy Function Approximation paper. If you want a step-by-step walkthrough of the ideas in the paper, have a look at my post on the generalized gradient boosting algorithm. In high-level terms, the algorithm for multi-class gradient boosting is:
Set the initial model predictions.
Repeat the following for each boosting round.
Repeat the following for each class.
Compute the pseudo residuals.
Train a regression tree to predict the pseudo residuals.
Adjust the tree’s predicted values to optimize the objective function.
Add the new tree to the current composite model.
Let’s take a look at the details for each of these steps.
Target variable encoding
Following the convention in scikit-learn, when training a multi-class classifier, the target variable in the training dataset should be integer encoded so that the distinct classes are mapped to the integers . In the code for model training, however, it’s going to be more convenient to work with a one hot encoded representation of the target. Therefore we’ll start by writing an internal method to transform the target variable from integer encoding to one hot encoding. Remember that eventually we’ll write a class for our multi-class gradient boosting model, so I’ll write this function like a class method with a leading argument called self
.
import numpy as np import pandas as pd from sklearn.preprocessing import OneHotEncoder def _one_hot_encode_labels(self, y): if isinstance(y, pd.Series): y = y.values ohe = OneHotEncoder() y_ohe = ohe.fit_transform(y.reshape(-1, 1)).toarray() return y_ohe
This code takes the integer-encoded target variable, makes sure it’s a numpy array, then uses cikit-learn’s one hot encoder to encode it as a 2D array with observations along the first axis and classes along the second axis. I tend to think of the one hot encoded output as a matrix with rows (the number of observations in the training data) and columns (the number of classes), although it’s technically not a matrix but rather a 2D array.
Model predictions in raw space and probability space
In amulti-class classification problem with classes, the model prediction for a particular observation returns a list of probabilities, one for each class. Essentially the model prediction is a conditional probability mass function for the discrete target variable, conditioned on the feature values.
So, we need a way to ensure that the model output is a valid probability mass function, i.e. each probability is in (0, 1) and the class probabilities sum to 1. Analogous to logistic regression, we can accomplish this by using the model to first make a raw prediction which can be any real number, then using something like the inverse logit function to transform the raw model prediction into a number between 0 and 1 that can be interpreted as a probability. Again analogous to logistic regression, in the multi-class setting we use different models, one for each class, to generate the raw predictions, then we transform the raw model predictions into probabilities using the softmax function,, which takes a length- vector of real numbers as input and returns a probability mass function over discrete classes.
Let be the list of raw model outputs, and let be the corresponding probability mass function over the classes, then the softmax function is defined as
Let's implement an internal softmax method that transforms the raw predictions into probabilities.
def _softmax(self, raw_predictions): numerator = np.exp(raw_predictions) denominator = np.sum(np.exp(raw_predictions), axis=1).reshape(-1, 1) return numerator / denominator
Initial model predictions
We’re now ready to implement model training, starting with line 1 of the algorithm which sets the initial model predictions. In our code, we’ll keep the raw model predictions for the observations in the training dataset in a size array called raw_predictions
, and we’ll keep the corresponding probabilities in another array called probabilities
. Perhaps the simplest reasonable initialization is to set the probabilities to , i.e. , which implies .
We’ll go ahead and create that one hot encoded representation of the target, then use it to set the right size for the model prediction arrays.
y_ohe = self._one_hot_encode_labels(y) raw_predictions = np.zeros(shape=y_ohe.shape) probabilities = self._softmax(raw_predictions)
Boosting
Line 2 of the algorithm kicks off a loop to iteratively perform boosting rounds. Within each round, line 3 specifies that we iterate through each of the classes, adding a new booster model for each class at each boosting round. We’ll keep all the boosters in a list called boosters
, where each element is itself a list which we’ll call class_trees
that contains the trees we trained in a given boosting round. For each round and each class, we compute the pseudo residuals (negative gradients), train a decision tree to predict them, update the tree’s predicted values to optimize the overall objective function, then update the current raw and probability predictions before storing the new tree in that round’s list of class trees.
self.boosters = [] for m in range(self.n_estimators): class_trees = [] for k in range(self.n_classes): negative_gradients = self._negative_gradients(y_ohe[:, k], probabilities[:, k]) hessians = self._hessians(probabilities[:, k]) tree = DecisionTreeRegressor(max_depth=self.max_depth) tree.fit(X, negative_gradients); self._update_terminal_nodes(tree, X, negative_gradients, hessians) raw_predictions[:, k] += self.learning_rate * tree.predict(X) probabilities = self._softmax(raw_predictions) class_trees.append(tree) self.boosters.append(class_trees)
Next we’ll dive into the details of the pseudo residual computation and the adjustment to the tree booster predicted values.
Pseudo Residuals
For each observation in the training dataset, the pseudo residual is the negative gradient of the objective function with respect to the corresponding model prediction. The objective function for multi-class classification is the Multinomial Negative Log Likelihood. For a single observation, the objective is
We can rewrite the objective in terms of our raw model output like this.
The negative gradient of the objective with respect to raw model prediction for training example is given by
You can take a look at the derivation if you’re curious how to work it out yourself. Note that this formula has a nice intuition. When , if predicted probability is terrible and close to 0, then the pseudo residual will be positive, and the next boosting round will try to increase the predicted probability. Otherwise if the predicted probability is already good and close to 1, the pseudo residual will be close to 0 and the next boosting round won’t change the predicted probability very much.
We can easily implement an internal method to compute the negative gradients over the training dataset as follows.
def _negative_gradients(self, y_ohe, probabilities): return y_ohe - probabilities
Adjusting the trees’ predicted values
After training a regression tree to predict the pseudo residuals, we need to adjust the predicted values in its terminal nodes to optimize the overall objective function. In the Greedy Function Approximation paper, Friedman actually specifies finding the optimal value using a numerical optimization routine like line search. We could express that like
where is the set of samples falling into this terminal node.
In the scikit-learn implementation of gradient boosting classification, the authors instead use the approach from FHT00 which uses a single Newton descent step to approximate the optimal predicted value for each terminal node. See code and comments for the function _update_terminal_regions
in the scikit-learn gradient boosting module. The updated value is computed like
We already found the first derivative of the objective, so we just need to calculate the second derivative.
Here’s the internal method to compute the second derivative .
def _hessians(self, probabilities): return probabilities * (1 - probabilities)
Then we can implement the internal method for updating the tree predicted values. I give more details about how to manually set scikit-learn’s decision tree predicted values in the post on gradient boosting with any loss function.
def _update_terminal_nodes(self, tree, X, negative_gradients, hessians): '''Update the terminal node predicted values''' # terminal node id's leaf_nodes = np.nonzero(tree.tree_.children_left == -1)[0] # compute leaf for each sample in ``X``. leaf_node_for_each_sample = tree.apply(X) for leaf in leaf_nodes: samples_in_this_leaf = np.where(leaf_node_for_each_sample == leaf)[0] negative_gradients_in_leaf = negative_gradients.take(samples_in_this_leaf, axis=0) hessians_in_leaf = hessians.take(samples_in_this_leaf, axis=0) val = np.sum(negative_gradients_in_leaf) / np.sum(hessians_in_leaf) tree.tree_.value[leaf, 0, 0] = val
Prediction
At inference time, the user supplies an X
with multiple observations of the feature variables, and our model needs to issue a prediction for each observation. We’ll start by implementing the predict_proba
method, which takes X
as input and returns a length- probability mass function for each observation in X
. To do this, we’ll initialize the raw predictions with zeros, just as we did in training, and then for each class, we’ll loop through all the boosters, collecting their predictions on X
, scaling by the learning rate, and summing them up. Finally, we use the softmax to transform raw predictions into the probabilities.
def predict_proba(self, X): '''Generate probability predictions for the given input data.''' raw_predictions = np.zeros(shape=(X.shape[0], self.n_classes)) for k in range(self.n_classes): for booster in self.boosters: raw_predictions[:, k] +=self.learning_rate * booster[k].predict(X) probabilities = self._softmax(raw_predictions) return probabilities
Then to get the predicted labels, we can use the predict_proba
method to generate probabilities, simply returning the integer-encoded class label of the largest probability for each observation in X
.
def predict(self, X): '''Generate predicted labels (as integer-encoded array)''' probabilities = self.predict_proba(X) return np.argmax(probabilities, axis=1)
The complete multi-class gradient boosting classification model implementation
Now we’re ready to implement a multi-class classification gradient boosting model class with public fit
, predict_proba
, and predict
methods. We combine the components above into a fit
method for model training, and we add the two prediction methods to complete the model’s functionality.
import numpy as np import pandas as pd from sklearn.tree import DecisionTreeRegressor from sklearn.preprocessing import OneHotEncoder class GradientBoostingClassifierFromScratch(): '''Gradient Boosting Classifier from Scratch. Parameters ---------- n_estimators : int number of boosting rounds learning_rate : float learning rate hyperparameter max_depth : int maximum tree depth ''' def __init__(self, n_estimators, learning_rate=0.1, max_depth=1): self.n_estimators=n_estimators; self.learning_rate=learning_rate self.max_depth=max_depth; def fit(self, X, y): '''Fit the GBM Parameters ---------- X : ndarray of size (number observations, number features) design matrix y : ndarray of size (number observations,) integer-encoded target labels in {0,1,...,k-1} ''' self.n_classes = pd.Series(y).nunique() y_ohe = self._one_hot_encode_labels(y) raw_predictions = np.zeros(shape=y_ohe.shape) probabilities = self._softmax(raw_predictions) self.boosters = [] for m in range(self.n_estimators): class_trees = [] for k in range(self.n_classes): negative_gradients = self._negative_gradients(y_ohe[:, k], probabilities[:, k]) hessians = self._hessians(probabilities[:, k]) tree = DecisionTreeRegressor(max_depth=self.max_depth) tree.fit(X, negative_gradients); self._update_terminal_nodes(tree, X, negative_gradients, hessians) raw_predictions[:, k] += self.learning_rate * tree.predict(X) probabilities = self._softmax(raw_predictions) class_trees.append(tree) self.boosters.append(class_trees) def _one_hot_encode_labels(self, y): if isinstance(y, pd.Series): y = y.values ohe = OneHotEncoder() y_ohe = ohe.fit_transform(y.reshape(-1, 1)).toarray() return y_ohe def _negative_gradients(self, y_ohe, probabilities): return y_ohe - probabilities def _hessians(self, probabilities): return probabilities * (1 - probabilities) def _softmax(self, raw_predictions): numerator = np.exp(raw_predictions) denominator = np.sum(np.exp(raw_predictions), axis=1).reshape(-1, 1) return numerator / denominator def _update_terminal_nodes(self, tree, X, negative_gradients, hessians): '''Update the terminal node predicted values''' # terminal node id's leaf_nodes = np.nonzero(tree.tree_.children_left == -1)[0] # compute leaf for each sample in ``X``. leaf_node_for_each_sample = tree.apply(X) for leaf in leaf_nodes: samples_in_this_leaf = np.where(leaf_node_for_each_sample == leaf)[0] negative_gradients_in_leaf = negative_gradients.take(samples_in_this_leaf, axis=0) hessians_in_leaf = hessians.take(samples_in_this_leaf, axis=0) val = np.sum(negative_gradients_in_leaf) / np.sum(hessians_in_leaf) tree.tree_.value[leaf, 0, 0] = val def predict_proba(self, X): '''Generate probability predictions for the given input data.''' raw_predictions = np.zeros(shape=(X.shape[0], self.n_classes)) for k in range(self.n_classes): for booster in self.boosters: raw_predictions[:, k] +=self.learning_rate * booster[k].predict(X) probabilities = self._softmax(raw_predictions) return probabilities def predict(self, X): '''Generate predicted labels (as 1-d array)''' probabilities = self.predict_proba(X) return np.argmax(probabilities, axis=1)
Testing our implementation
Let’s test our implementation alongside the scikit-learn GradientBoostingClassifier
to ensure it works as expected.
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score X, y = make_classification(n_samples=10000, n_classes=5, n_features=20, n_informative=10, random_state=0) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.ensemble import GradientBoostingClassifier gbc = GradientBoostingClassifier(n_estimators=10, learning_rate=0.3, max_depth=6) gbc.fit(X_train, y_train) accuracy_score(y_test, gbc.predict(X_test))
0.7756
gbcfs = GradientBoostingClassifierFromScratch(n_estimators=10, learning_rate=0.3, max_depth=6) gbcfs.fit(X_train, y_train) accuracy_score(y_test, gbcfs.predict(X_test))
0.7768
Beautiful. Our implementation is performing comparably to the sklearn gradient boosting classifier!
Wrapping Up
Well, there you have it, another epic scratch build for the books. I think the most interesting thing about the multi-class gradient boosting algorithm is that it generates multi-dimensional predictions based on a single objective function by training multiple decision trees in each boosting round. That’s a very interesting extension of the classic gradient boosting machine! If you have questions about the implementation, or if you found this post helpful, please leave a comment below to tell me about it.
References
Want to share your content on python-bloggers? click here.