Want to share your content on python-bloggers? click here.
Tell me dear reader, who among us, while gazing in wonder at the improbably verdant aloe vera clinging to the windswept rock at Cape Point near the southern tip of Africa, hasn’t wondered: how the heck do gradient boosting trees implement multi-class classification? Today, we’ll unravel this mystery by reviewing the theory and implementing the algorithm for ourselves in python. Specifically, we’ll review the multi-class gradient boosting model originally described in Friedman’s classic Greedy Function Approximation paper, and we’ll implement components of the algorithm as we go along. Once we have all the pieces, we’ll write a python class for multi-class gradient boosting with a similar API to the scikit-learn GradientBoostingClassifier
.
If you need a refresher on gradient boosting before diving in here, then start with my original gradient boosting from scratch post, which is the first installment in my ongoing series on gradient boosting.
The multi-class gradient boosting classification algorithm
Friedman describes the algorithm for training a multi-class classification gradient boosting model in Algorithm 6 of the classic Greedy Function Approximation paper. If you want a step-by-step walkthrough of the ideas in the paper, have a look at my post on the generalized gradient boosting algorithm. In high-level terms, the algorithm for multi-class gradient boosting is:
-
Set the initial model predictions.
-
Repeat the following for each boosting round.
-
Repeat the following for each class.
-
Compute the pseudo residuals.
-
Train a regression tree to predict the pseudo residuals.
-
Adjust the tree’s predicted values to optimize the objective function.
-
Add the new tree to the current composite model.
Let’s take a look at the details for each of these steps.
Target variable encoding
Following the convention in scikit-learn, when training a multi-class classifier, the target variable in the training dataset should be integer encoded so that the self
.
import numpy as np import pandas as pd from sklearn.preprocessing import OneHotEncoder def _one_hot_encode_labels(self, y): if isinstance(y, pd.Series): y = y.values ohe = OneHotEncoder() y_ohe = ohe.fit_transform(y.reshape(-1, 1)).toarray() return y_ohe
This code takes the integer-encoded target variable, makes sure it’s a numpy array, then uses cikit-learn’s one hot encoder to encode it as a 2D array with observations along the first axis and classes along the second axis. I tend to think of the one hot encoded output as a matrix with
Model predictions in raw space and probability space
In amulti-class classification problem with
So, we need a way to ensure that the model output is a valid probability mass function, i.e. each probability is in (0, 1) and the
Let
Let's implement an internal softmax method that transforms the raw predictions into probabilities.
def _softmax(self, raw_predictions): numerator = np.exp(raw_predictions) denominator = np.sum(np.exp(raw_predictions), axis=1).reshape(-1, 1) return numerator / denominator
Initial model predictions
We’re now ready to implement model training, starting with line 1 of the algorithm which sets the initial model predictions. In our code, we’ll keep the raw model predictions raw_predictions
, and we’ll keep the corresponding probabilities probabilities
. Perhaps the simplest reasonable initialization is to set the probabilities to
We’ll go ahead and create that one hot encoded representation of the target, then use it to set the right size for the model prediction arrays.
y_ohe = self._one_hot_encode_labels(y) raw_predictions = np.zeros(shape=y_ohe.shape) probabilities = self._softmax(raw_predictions)
Boosting
Line 2 of the algorithm kicks off a loop to iteratively perform boosting rounds. Within each round, line 3 specifies that we iterate through each of the boosters
, where each element is itself a list which we’ll call class_trees
that contains the
self.boosters = [] for m in range(self.n_estimators): class_trees = [] for k in range(self.n_classes): negative_gradients = self._negative_gradients(y_ohe[:, k], probabilities[:, k]) hessians = self._hessians(probabilities[:, k]) tree = DecisionTreeRegressor(max_depth=self.max_depth) tree.fit(X, negative_gradients); self._update_terminal_nodes(tree, X, negative_gradients, hessians) raw_predictions[:, k] += self.learning_rate * tree.predict(X) probabilities = self._softmax(raw_predictions) class_trees.append(tree) self.boosters.append(class_trees)
Next we’ll dive into the details of the pseudo residual computation and the adjustment to the tree booster predicted values.
Pseudo Residuals
For each observation in the training dataset, the pseudo residual is the negative gradient of the objective function with respect to the corresponding model prediction. The objective function for multi-class classification is the Multinomial Negative Log Likelihood. For a single observation, the objective is
We can rewrite the objective in terms of our raw model output
The negative gradient of the objective with respect to raw model prediction
You can take a look at the derivation if you’re curious how to work it out yourself. Note that this formula has a nice intuition. When
We can easily implement an internal method to compute the negative gradients over the training dataset as follows.
def _negative_gradients(self, y_ohe, probabilities): return y_ohe - probabilities
Adjusting the trees’ predicted values
After training a regression tree to predict the pseudo residuals, we need to adjust the predicted values in its terminal nodes to optimize the overall objective function. In the Greedy Function Approximation paper, Friedman actually specifies finding the optimal value using a numerical optimization routine like line search. We could express that like
where
In the scikit-learn implementation of gradient boosting classification, the authors instead use the approach from FHT00 which uses a single Newton descent step to approximate the optimal predicted value for each terminal node. See code and comments for the function _update_terminal_regions
in the scikit-learn gradient boosting module. The updated value is computed like
We already found the first derivative of the objective, so we just need to calculate the second derivative.
Here’s the internal method to compute the second derivative .
def _hessians(self, probabilities): return probabilities * (1 - probabilities)
Then we can implement the internal method for updating the tree predicted values. I give more details about how to manually set scikit-learn’s decision tree predicted values in the post on gradient boosting with any loss function.
def _update_terminal_nodes(self, tree, X, negative_gradients, hessians): '''Update the terminal node predicted values''' # terminal node id's leaf_nodes = np.nonzero(tree.tree_.children_left == -1)[0] # compute leaf for each sample in ``X``. leaf_node_for_each_sample = tree.apply(X) for leaf in leaf_nodes: samples_in_this_leaf = np.where(leaf_node_for_each_sample == leaf)[0] negative_gradients_in_leaf = negative_gradients.take(samples_in_this_leaf, axis=0) hessians_in_leaf = hessians.take(samples_in_this_leaf, axis=0) val = np.sum(negative_gradients_in_leaf) / np.sum(hessians_in_leaf) tree.tree_.value[leaf, 0, 0] = val
Prediction
At inference time, the user supplies an X
with multiple observations of the feature variables, and our model needs to issue a prediction for each observation. We’ll start by implementing the predict_proba
method, which takes X
as input and returns a length-X
. To do this, we’ll initialize the raw predictions with zeros, just as we did in training, and then for each class, we’ll loop through all the boosters, collecting their predictions on X
, scaling by the learning rate, and summing them up. Finally, we use the softmax to transform raw predictions into the probabilities.
def predict_proba(self, X): '''Generate probability predictions for the given input data.''' raw_predictions = np.zeros(shape=(X.shape[0], self.n_classes)) for k in range(self.n_classes): for booster in self.boosters: raw_predictions[:, k] +=self.learning_rate * booster[k].predict(X) probabilities = self._softmax(raw_predictions) return probabilities
Then to get the predicted labels, we can use the predict_proba
method to generate probabilities, simply returning the integer-encoded class label of the largest probability for each observation in X
.
def predict(self, X): '''Generate predicted labels (as integer-encoded array)''' probabilities = self.predict_proba(X) return np.argmax(probabilities, axis=1)
The complete multi-class gradient boosting classification model implementation
Now we’re ready to implement a multi-class classification gradient boosting model class with public fit
, predict_proba
, and predict
methods. We combine the components above into a fit
method for model training, and we add the two prediction methods to complete the model’s functionality.
import numpy as np import pandas as pd from sklearn.tree import DecisionTreeRegressor from sklearn.preprocessing import OneHotEncoder class GradientBoostingClassifierFromScratch(): '''Gradient Boosting Classifier from Scratch. Parameters ---------- n_estimators : int number of boosting rounds learning_rate : float learning rate hyperparameter max_depth : int maximum tree depth ''' def __init__(self, n_estimators, learning_rate=0.1, max_depth=1): self.n_estimators=n_estimators; self.learning_rate=learning_rate self.max_depth=max_depth; def fit(self, X, y): '''Fit the GBM Parameters ---------- X : ndarray of size (number observations, number features) design matrix y : ndarray of size (number observations,) integer-encoded target labels in {0,1,...,k-1} ''' self.n_classes = pd.Series(y).nunique() y_ohe = self._one_hot_encode_labels(y) raw_predictions = np.zeros(shape=y_ohe.shape) probabilities = self._softmax(raw_predictions) self.boosters = [] for m in range(self.n_estimators): class_trees = [] for k in range(self.n_classes): negative_gradients = self._negative_gradients(y_ohe[:, k], probabilities[:, k]) hessians = self._hessians(probabilities[:, k]) tree = DecisionTreeRegressor(max_depth=self.max_depth) tree.fit(X, negative_gradients); self._update_terminal_nodes(tree, X, negative_gradients, hessians) raw_predictions[:, k] += self.learning_rate * tree.predict(X) probabilities = self._softmax(raw_predictions) class_trees.append(tree) self.boosters.append(class_trees) def _one_hot_encode_labels(self, y): if isinstance(y, pd.Series): y = y.values ohe = OneHotEncoder() y_ohe = ohe.fit_transform(y.reshape(-1, 1)).toarray() return y_ohe def _negative_gradients(self, y_ohe, probabilities): return y_ohe - probabilities def _hessians(self, probabilities): return probabilities * (1 - probabilities) def _softmax(self, raw_predictions): numerator = np.exp(raw_predictions) denominator = np.sum(np.exp(raw_predictions), axis=1).reshape(-1, 1) return numerator / denominator def _update_terminal_nodes(self, tree, X, negative_gradients, hessians): '''Update the terminal node predicted values''' # terminal node id's leaf_nodes = np.nonzero(tree.tree_.children_left == -1)[0] # compute leaf for each sample in ``X``. leaf_node_for_each_sample = tree.apply(X) for leaf in leaf_nodes: samples_in_this_leaf = np.where(leaf_node_for_each_sample == leaf)[0] negative_gradients_in_leaf = negative_gradients.take(samples_in_this_leaf, axis=0) hessians_in_leaf = hessians.take(samples_in_this_leaf, axis=0) val = np.sum(negative_gradients_in_leaf) / np.sum(hessians_in_leaf) tree.tree_.value[leaf, 0, 0] = val def predict_proba(self, X): '''Generate probability predictions for the given input data.''' raw_predictions = np.zeros(shape=(X.shape[0], self.n_classes)) for k in range(self.n_classes): for booster in self.boosters: raw_predictions[:, k] +=self.learning_rate * booster[k].predict(X) probabilities = self._softmax(raw_predictions) return probabilities def predict(self, X): '''Generate predicted labels (as 1-d array)''' probabilities = self.predict_proba(X) return np.argmax(probabilities, axis=1)
Testing our implementation
Let’s test our implementation alongside the scikit-learn GradientBoostingClassifier
to ensure it works as expected.
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score X, y = make_classification(n_samples=10000, n_classes=5, n_features=20, n_informative=10, random_state=0) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.ensemble import GradientBoostingClassifier gbc = GradientBoostingClassifier(n_estimators=10, learning_rate=0.3, max_depth=6) gbc.fit(X_train, y_train) accuracy_score(y_test, gbc.predict(X_test))
0.7756
gbcfs = GradientBoostingClassifierFromScratch(n_estimators=10, learning_rate=0.3, max_depth=6) gbcfs.fit(X_train, y_train) accuracy_score(y_test, gbcfs.predict(X_test))
0.7768
Beautiful. Our implementation is performing comparably to the sklearn gradient boosting classifier!
Wrapping Up
Well, there you have it, another epic scratch build for the books. I think the most interesting thing about the multi-class gradient boosting algorithm is that it generates multi-dimensional predictions based on a single objective function by training multiple decision trees in each boosting round. That’s a very interesting extension of the classic gradient boosting machine! If you have questions about the implementation, or if you found this post helpful, please leave a comment below to tell me about it.
References
Want to share your content on python-bloggers? click here.