ML + XAI -> Strong GLM in Python
This article was first published on Python – Michael's and Christian's Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.
Want to share your content on python-bloggers? click here.
In our latest post, we explained how to use ML + XAI to build strong generalized linear models with R. Let’s do the same with Python.
Insurance pricing data
We will use again a synthetic dataset with 1 Mio insurance policies, with reference:
Mayer, M., Meier, D. and Wuthrich, M.V. (2023),
SHAP for Actuaries: Explain any Model.
http://dx.doi.org/10.2139/ssrn.4389797
Let’s start by loading and describing the data:
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import shap from sklearn.datasets import fetch_openml from sklearn.inspection import PartialDependenceDisplay from sklearn.metrics import mean_poisson_deviance from sklearn.dummy import DummyRegressor from lightgbm import LGBMRegressor # We need preview version of glum that adds formulaic API # !pip install git+https://github.com/Quantco/glum@glum-v3#egg=glum from glum import GeneralizedLinearRegressor # Load data df = fetch_openml(data_id=45106, parser="pandas").frame df.head() # Continuous features df.hist(["driver_age", "car_weight", "car_power", "car_age"]) _ = plt.suptitle("Histograms of continuous features", fontsize=15) # Response and discrete features fig, axes = plt.subplots(figsize=(8, 3), ncols=3) for v, ax in zip(["claim_nb", "year", "town"], axes): df[v].value_counts(sort=False).sort_index().plot(kind="bar", ax=ax, rot=0, title=v) plt.suptitle("Barplots of response and discrete features", fontsize=15) plt.tight_layout() plt.show() # Rank correlations corr = df.corr("spearman") mask = np.triu(np.ones_like(corr, dtype=bool)) plt.suptitle("Rank-correlogram", fontsize=15) _ = sns.heatmap( corr, mask=mask, vmin=-0.7, vmax=0.7, center=0, cmap="vlag", square=True )
Modeling
- We fit a tuned Boosted Trees model to model log(E(claim count)) via Poisson deviance loss.
- And perform a SHAP analysis to derive insights.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( df.drop("claim_nb", axis=1), df["claim_nb"], test_size=0.1, random_state=30 ) # Tuning step not shown. Number of boosting rounds found via early stopping on CV performance params = dict( learning_rate=0.05, objective="poisson", num_leaves=7, min_child_samples=50, min_child_weight=0.001, colsample_bynode=0.8, subsample=0.8, reg_alpha=3, reg_lambda=5, verbose=-1, ) model_lgb = LGBMRegressor(n_estimators=360, **params) model_lgb.fit(X_train, y_train) # SHAP analysis X_explain = X_train.sample(n=2000, random_state=937) explainer = shap.Explainer(model_lgb) shap_val = explainer(X_explain) plt.suptitle("SHAP importance", fontsize=15) shap.plots.bar(shap_val) for s in [shap_val[:, 0:3], shap_val[:, 3:]]: shap.plots.scatter(s, color=shap_val, ymin=-0.5, ymax=1)
Here, we would come to the conclusions:
car_weight
andyear
might be dropped, depending on the specify aim of the model.- Add a regression spline for
driver_age
. - Add an interaction between
car_power
andtown
.
Build strong GLM
Let’s build a GLM with these insights. Two important things:
- Glum is an extremely powerful GLM implementation that was inspired by a pull request of our Christian Lorentzen.
- In the upcoming version 3.0, it adds a formula API based of formulaic, a very performant formula parser. This gives a very easy way to add interaction effects, regression splines, dummy encodings etc.
model_glm = GeneralizedLinearRegressor( family="poisson", l1_ratio=1.0, alpha=1e-10, formula="car_power * C(town) + bs(driver_age, 7) + car_age", ) _ = model_glm.fit(X_train, y=y_train) # PDPs of both models fig, ax = plt.subplots(2, 2, figsize=(7, 5)) cols = ("tab:blue", "tab:orange") for color, name, model in zip(cols, ("GLM", "LGB"), (model_glm, model_lgb)): disp = PartialDependenceDisplay.from_estimator( model, features=["driver_age", "car_age", "car_power", "town"], X=X_explain, ax=ax if name == "GLM" else disp.axes_, line_kw={"label": name, "color": color}, ) fig.suptitle("PDPs of both models", fontsize=15) fig.tight_layout() # Stratified PDP of car_power for color, town in zip(("tab:blue", "tab:orange"), (0, 1)): mask = X_explain.town == town disp = PartialDependenceDisplay.from_estimator( model_glm, features=["car_power"], X=X_explain[mask], ax=None if town == 0 else disp.axes_, line_kw={"label": town, "color": color}, ) plt.suptitle("PDP of car_power stratified by town (0 vs 1)", fontsize=15) _ = plt.ylim(0, 0.2)
In this relatively simple situation, the mean Poisson deviance of our models are very simlar now:
model_dummy = DummyRegressor().fit(X_train, y=y_train) deviance_null = mean_poisson_deviance(y_test, model_dummy.predict(X_test)) dev_imp = [] for name, model in zip(("GLM", "LGB", "Null"), (model_glm, model_lgb, model_dummy)): dev_imp.append((name, mean_poisson_deviance(y_test, model.predict(X_test)))) pd.DataFrame(dev_imp, columns=["Model", "Mean_Poisson_Deviance"])
Final words
- Glum is an extremely interesting GLM implementation. We have only scratched the surface. You can expect more blogposts on Glum…
- Having a formula interface is especially useful for adding interactions. Fingers crossed that the upcoming version 3.0 will soon be released.
- Building GLMs via ML + XAI is so smooth, especially when you work with large data. For small data, you need to be careful to not add hidden overfitting to the model.
Click here for the full Python notebook
To leave a comment for the author, please follow the link and comment on their blog: Python – Michael's and Christian's Blog .
Want to share your content on python-bloggers? click here.