Want to share your content on python-bloggers? click here.
This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions. It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.
Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.
For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.
- Data preparation
- Regression decision tree baseline model
- Hyperparameter tuning of Adaboost regression model
- AdaBoost regression model development
Below is some initial code
from sklearn.ensemble import AdaBoostRegressor from sklearn import tree from sklearn.model_selection import GridSearchCV import numpy as np from pydataset import data import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.model_selection import train_test_split from sklearn.model_selection import KFold from sklearn.metrics import mean_squared_error
Data Preparation
There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.
df=data('cancer').dropna() X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']] y=df['wt.loss']
We will now proceed to creating the baseline regression decision tree model.
Baseline Regression Tree Model
The purpose of the baseline model is for comparing it to the performance of our model that utilizes adaBoost. In order to make this model we need to Initiate a Kfold cross-validation. This will help in stabilizing the results. Next we will create a for loop so that we can create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.
Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results
for depth in range (1,10): tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1) if tree_regressor.fit(X,y).tree_.max_depth<depth: break score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1)) print(depth, score) 1 -193.55304528235052 2 -176.27520747356175 3 -209.2846723461564 4 -218.80238479654003 5 -222.4393459885871 6 -249.95330609042858 7 -286.76842138165705 8 -294.0290706405905 9 -287.39016236497804
Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.
Hyperparameter Tuning
For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameters which is random_state but this is just for setting the seed and never changes.
After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function you have to set the estimator which is adaBoostRegressor, the parameter grid which we just made, the cross validation which we made when we created the baseline model, and the n_jobs which allocates resources for the calculation. Below is the code.
ada=AdaBoostRegressor() search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'random_state':[1]} search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)
Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.
search.fit(X,y) search.best_params_ Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1} search.best_score_ Out[32]: -164.93176650920856
The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with the refined hyperparameters.
AdaBoost Regression Model
Below is our model but this time with the refined hyperparameters.
ada2=AdaBoostRegressor(n_estimators=500,learning_rate=0.001,random_state=1) score=np.mean(cross_val_score(ada2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1)) score Out[36]: -174.52604137201791
You can see the score is not as good but it is within reason.
Conclusion
In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.
Want to share your content on python-bloggers? click here.