Combining Algorithms for Classification with Python
Want to share your content on python-bloggers? click here.
Many approaches in machine learning involve making many models that combine their strength and weaknesses to make more accuracy classification. Generally, when this is done it is the same algorithm being used. For example, random forest is simply many decision trees being developed. Even when bagging or boosting is being used it is the same algorithm but with variances in sampling and the use of features.
In addition to this common form of ensemble learning there is also a way to combine different algorithms to make predictions. For one way of doing this is through a technique called stacking in which the predictions of several models are passed to a higher model that uses the individual model predictions to make a final prediction. In this post we will look at how to do this using Python.
Assumptions
This blog usually tries to explain as much as possible about what is happening. However, due to the complexity of this topic there are several assumptions about the reader’s background.
- Already familiar with python
- Can use various algorithms to make predictions (logistic regression, linear discriminant analysis, decision trees, K nearest neighbors)
- Familiar with cross-validation and hyperparameter tuning
We will be using the Mroz dataset in the pydataset module. Our goal is to use several of the independent variables to predict whether someone lives in the city or not.
The steps we will take in this post are as follows
- Data preparation
- Individual model development
- Ensemble model development
- Hyperparameter tuning of ensemble model
- Ensemble model testing
Below is all of the libraries we will be using in this post
import pandas as pd from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import LabelEncoder from pydataset import data from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from mlxtend.classifier import EnsembleVoteClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report
Data Preparation
We need to perform the following steps for the data preparation
- Load the data
- Select the independent variables to be used in the analysis
- Scale the independent variables
- Convert the dependent variable from text to numbers
- Split the data in train and test sets
Not all of the variables in the Mroz dataset were used. Some were left out because they were highly correlated with others. This analysis is not in this post but you can explore this on your own. The data was also scaled because many algorithms are sensitive to this so it is best practice to always scale the data. We will use the StandardScaler function for this. Lastly, the dpeendent variable currently consist of values of “yes” and “no” these need to be convert to numbers 1 and 0. We will use the LabelEncoder function for this. The code for all of this is below.
df=data('Mroz') X,y=df[['hoursw','child6','child618','educw','hearnw','hoursh','educh','wageh','educwm','educwf','experience']],df['city'] sc=StandardScaler() X_scale=sc.fit_transform(X) X=pd.DataFrame(X_scale, index=X.index, columns=X.columns) le=LabelEncoder() y=le.fit_transform(y) X_train, X_test,y_train, y_test=train_test_split(X,y,test_size=.3,random_state=5)
We can now proceed to individul model development
Individual Model Development
Below are the steps for this part of the analysis
- Instantiate an instance of each algorithm
- Check accuracy of each model
- Check roc curve of each model
We will create four different models, and they are logistic regression, decision tree, k nearest neighbor, and linear discriminant analysis. We will also set some initial values for the hyperparameters for each. Below is the code
logclf=LogisticRegression(penalty='l2',C=0.001, random_state=0) treeclf=DecisionTreeClassifier(max_depth=3,criterion='entropy',random_state=0) knnclf=KNeighborsClassifier(n_neighbors=5,p=2,metric='minkowski') LDAclf=LDA()
We can now assess the accuracy and roc curve of each model. This will be done through using two separate for loops. The first will have the accuracy results and the second will have the roc curve results. The results will also use k-fold cross validation with the cross_val_score function. Below is the code with the results.
clf_labels=['Logistic Regression','Decision Tree','KNN','LDAclf'] for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='accuracy') print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='roc_auc') print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) accuracy: 0.69 (+/- 0.04) [Logistic Regression] accuracy: 0.72 (+/- 0.06) [Decision Tree] accuracy: 0.66 (+/- 0.06) [KNN] accuracy: 0.70 (+/- 0.05) [LDAclf] roc auc: 0.71 (+/- 0.08) [Logistic Regression] roc auc: 0.70 (+/- 0.07) [Decision Tree] roc auc: 0.62 (+/- 0.10) [KNN] roc auc: 0.70 (+/- 0.08) [LDAclf]
The results can speak for themselves. We have a general accuracy of around 70% but our roc auc is poor. Despite this we will now move to the ensemble model development.
Ensemble Model Development
The ensemble model requires the use of the EnsembleVoteClassifier function. Inside this function are the four models we made earlier. Other than this the rest of the code is the same as the previous step. We will assess the accuracy and the roc auc. Below is the code and the results
mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1]) for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='accuracy') print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='roc_auc') print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) accuracy: 0.69 (+/- 0.04) [LR] accuracy: 0.72 (+/- 0.06) [tree] accuracy: 0.66 (+/- 0.06) [knn] accuracy: 0.70 (+/- 0.05) [LDA] accuracy: 0.70 (+/- 0.04) [combine] roc auc: 0.71 (+/- 0.08) [LR] roc auc: 0.70 (+/- 0.07) [tree] roc auc: 0.62 (+/- 0.10) [knn] roc auc: 0.70 (+/- 0.08) [LDA] roc auc: 0.72 (+/- 0.09) [combine]
You can see that the combine model as similar performance to the individual models. This means in this situation that the ensemble learning did not make much of a difference. However, we have not tuned are hyperparameters yet. This will be done in the next step.
Hyperparameter Tuning of Ensemble Model
We are going to tune the decision tree, logistic regression, and KNN model. There are many different hyperparameters we can tune. For demonstration purposes we are only tuning one hyperparameter per algorithm. Once we set the hyperparameters we will run the model and pull the best hyperparameters values based on the roc auc as the metric. Below is the code and the output.
params={'decisiontreeclassifier__max_depth':[2,3,5], 'logisticregression__C':[0.001,0.1,1,10], 'kneighborsclassifier__n_neighbors':[5,7,9,11]} grid=GridSearchCV(estimator=mv_clf,param_grid=params,cv=10,scoring='roc_auc') grid.fit(X_train,y_train) grid.best_params_ Out[34]: {'decisiontreeclassifier__max_depth': 3, 'kneighborsclassifier__n_neighbors': 9, 'logisticregression__C': 10} grid.best_score_ Out[35]: 0.7196051482279385
The best values are as follows
- Decision tree max depth set to 3
- KNN number of neighbors set to 9
- logistic regression C set to 10
These values give us a roc auc of 0.72 which is still poor . We can now use these values when we test our final model.
Ensemble Model Testing
The following steps are performed in the analysis
- Created new instances of the algorithms with the adjusted hyperparameters
- Run the ensemble model
- Predict with the test data
- Check the results
Below is the first step
logclf=LogisticRegression(penalty='l2',C=10, random_state=0) treeclf=DecisionTreeClassifier(max_depth=3,criterion='entropy',random_state=0) knnclf=KNeighborsClassifier(n_neighbors=9,p=2,metric='minkowski') LDAclf=LDA()
Below is step two
mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1]) mv_clf.fit(X_train,y_train)
Below are steps 3 and four
y_pred=mv_clf.predict(X_test) print(accuracy_score(y_test,y_pred)) print(pd.crosstab(y_test,y_pred)) print(classification_report(y_test,y_pred)) 0.6902654867256637 col_0 0 1 row_0 0 29 58 1 12 127 precision recall f1-score support 0 0.71 0.33 0.45 87 1 0.69 0.91 0.78 139 avg / total 0.69 0.69 0.66 226
The accuracy is about 69%. One thing that is noticeable low is the recall for people who do not live in the city. This probably one reason why the overall roc auc score is so low. The f1-score is also low for those who do not live in the city as well. The f1-score is just a combination of precision and recall. If we really want to improve performance we would probably start with improving the recall of the no’s.
Conclusion
This post provided an example of how you can combine different algorithms to make predictions in Python. This is a powerful technique t to use. Off course, it is offset by the complexity of the analysis which makes it hard to explain exactly what the results mean if you were asked tot do so.
Want to share your content on python-bloggers? click here.