Want to share your content on python-bloggers? click here.
Nowadays, explaining the decisions of Statistical/Machine learning (ML) algorithms is
becoming a must, and also, mainstream. In healthcare for example, ML explainers could help in understanding how black-box – but accurate – ML prognosis about patients are formed.
One way to obtain these explanations (here is another way that I introduced in a previous post, based on Kernel Ridge
regression), is to use the teller. The teller computes explanatory variables’s effects by using finite differences. In this post, in particular, the teller is utilized to explain the popular xgboost’s predictions on the Boston dataset.
The Boston dataset contains the following columns:
-
crim: per capita crime rate by town.
-
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
-
indus: proportion of non-retail business acres per town.
-
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
-
nox: nitrogen oxides concentration (parts per 10 million).
-
rm: average number of rooms per dwelling.
-
age: proportion of owner-occupied units built prior to 1940.
-
dis: weighted mean of distances to five Boston employment centres.
-
rad: index of accessibility to radial highways.
-
tax: full-value property-tax rate per $10,000.
-
ptratio: pupil-teacher ratio by town.
-
lstat: lower status of the population (percent).
-
medv: median value of owner-occupied homes in $1000s.
Our objective is understand how xgboost’s predictions of medv
, are influenced by the other explanatory variables.
Installing packages teller
and xgboost
!pip install the-teller --upgrade
!pip install xgboost --upgrade
Applying the teller’s Explainer
to xgboost predictions
We start by importing the packages and dataset useful for the demo:
import teller as tr import pandas as pd import numpy as np import xgboost as xgb from sklearn import datasets, linear_model from sklearn.datasets import load_boston from sklearn import datasets from sklearn.model_selection import train_test_split from time import time # import data boston = datasets.load_boston() X = np.delete(boston.data, 11, 1) y = boston.target col_names = np.append(np.delete(boston.feature_names, 11), 'MEDV')
The dataset is splitted into a training set and a test set, then xgboost is
adjusted to the training set:
# training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1233) # fitting xgboost to the training set regr = xgb.XGBRegressor(max_depth = 4, n_estimators = 100).fit(X_train, y_train)
The teller’s Explainer
is now used in order to: understand how xgboost’s predictions of medv
are influenced by the explanatory variables.
start = time() # creating an Explainer for the fitted object `regr` expr = tr.Explainer(obj=regr) # confidence int. and tests on covariates' effects (Jackknife) expr.fit(X_test, y_test, X_names=col_names[:-1], y_name=col_names[-1], method="ci") # summary of results expr.summary() # timing print(f"\n Elapsed: {time()-start}")
The variables with the most impactful effect on medv
are nox
and rm
which is an acceptable observation: an increasing number of rooms drives the price higher, whereas pollution, an increase in nitrogen oxides concentration (as long as the information is well-known by people in the city) drives a decrease in home prices.
In order to obtain the 95% confidence intervals presented in the output, Jackknife resampling is employed. If the confidence interval
does not contain 0, then the average effect is significantly different from 0, and the hypothesis
that it’s equal to 0 is rejected with a 5% risk of being wrong.
The little stars on the right indicate how significant is the Student test (not robust, but still carrying some useful information) “average effect of covariate x = 0” versus the contrary, with a 5% risk of being wrong.
Armed with this, we could say that in this context, in this particular setting, removing RAD, ZN, CHAS, CRIM is suggested by xgboost
and the teller.
Want to share your content on python-bloggers? click here.