AdaBoost Classification in Python

This article was first published on python – educational research techniques , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Boosting is a technique in machine learning in which multiple models are developed sequentially. Each new model tries to successful predict what prior models were unable to do. The average for regression and majority vote for classification are used. For classification, boosting is commonly associated with decision trees. However, boosting can be used with any machine learning algorithm in the supervised learning context.

Since several models are being developed with aggregation, boosting is associated with ensemble learning. Ensemble is just a way of developing more than one model for machine-learning purposes. With boosting, the assumption is that the combination of several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the status of a patient based on several independent variables. The steps of this process are as follows.

  1. Data preparation
  2. Decision tree baseline model
  3. Hyperparameter tuning of Adaboost model
  4. AdaBoost model development

Below is some initial code

from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

Data preparation is minimal in this situation. We will load are data and at the same time drop any NA using the .dropna() function. In addition, we will place the independent variables in dataframe called X and the dependent variable in a dataset called y. Below is the code.

df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']

Decision Tree Baseline Model

We will make a decision tree just for the purposes of comparison. First, we will set the parameters for the cross-validation. Then we will use a for loop to run several different decision trees. The difference in the decision trees will be their depth. The depth is how far the tree can go in order to purify the classification. The more depth the more likely your decision tree is to overfit the data. The last thing we will do is print the results. Below is the code with the output

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
    tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
    if tree_classifier.fit(X,y).tree_.max_depth<depth:
        break
    score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
    print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941

You can see that the most accurate decision tree had a depth of 1. After that there was a general decline in accuracy.

We now can determine if the adaBoost model is better based on whether the accuracy is above 72%. Before we develop the  AdaBoost model, we need to tune several hyperparameters in order to develop the most accurate model possible.

Hyperparameter Tuning AdaBoost Model

In order to tune the hyperparameters there are several things that we need to do. First we need to initiate  our AdaBoostClassifier with some basic settings. Then We need to create our search grid with the hyperparameters. There are two hyperparameters that we will set and they are number of estimators (n_estimators) and the learning rate.

Number of estimators has to do with how many trees are developed. The learning rate indicates how each tree contributes to the overall results. We have to place in the grid several values for each of these. Once we set the arguments for the AdaBoostClassifier and the search grid we combine all this information into an object called search. This object uses the GridSearchCV function and includes additional arguments for scoring, n_jobs, and for cross-validation. Below is the code for all of this

ada=AdaBoostClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)

We can now run the model of hyperparameter tuning and see the results. The code is below.

search.fit(X,y)
search.best_params_
Out[33]: {'learning_rate': 0.01, 'n_estimators': 1000}
search.best_score_
Out[34]: 0.7425149700598802

We can see that if the learning rate is set to 0.01 and the number of estimators to 1000 We can expect an accuracy of 74%. This is superior to our baseline model.

AdaBoost Model

We can now rune our AdaBoost Classifier based on the recommended hyperparameters. Below is the code.

score=np.mean(cross_val_score(ada,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))
score
Out[36]: 0.7415441176470589

We knew we would get around 74% and that is what we got. It’s only a 3% improvement but depending on the context that can be a substantial difference.

Conclusion

In this post, we look at how to use boosting for classification. In particular, we used the AdaBoost algorithm. Boosting in general uses many models to determine the most accurate classification in a sequential manner. Doing this will often lead to an improvement in the prediction of a model.

To leave a comment for the author, please follow the link and comment on their blog: python – educational research techniques .

Want to share your content on python-bloggers? click here.