Model Diagnostics: Statistics vs Machine Learning

Posted on May 1, 2025 by Christian Lorentzen in Data science | 0 Comments

This article was first published on Python – Michael's and Christian's Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this post, we show how different use cases require different model diagnostics. In short, we compare (statistical) inference and prediction.

As an example, we use a simple linear model for the Munich rent index dataset, which was kindly provided by the authors of Regression – Models, Methods and Applications 2nd ed. (2021). This dataset contains monthy rents in EUR (rent) for about 3000 apartments in Munich, Germany, from 1999. The apartments have several features such as living area (area) in squared meters, year of construction (yearc), quality of location (location, 0: average, 1: good, 2: top), quality of bath rooms (bath, 0:standard, 1: premium), quality of kitchen (kitchen, 0: standard, 1: premium), indicator for central heating (cheating).

The target variable is Y=\text{rent} and the goal of our model is to predict the mean rent, E[Y] (we omit the conditioning on X for brevity).

Disclaimer: Before presenting the use cases, let me clearly state that I am not in the apartment rent business and everything here is merely for the purpose of demonstrating statistical good practice.

Inference

The first use case is about inference of the effect of the features. Imagine the point of view of an investor who wants to know whether the installation of a central heating is worth it (financially). To lay the ground on which to base a decision, a statistician must have answers to:

What is the effect of the variable cheating on the rent.
Is this effect statistically significant?

Prediction

The second use case is about prediction. This time, we take the point of view of someone looking out for a new apartment to rent. In order to know whether the proposed rent by the landlord is about right or improper (too high), a reference value would be very convenient. One can either ask the neighbors or ask a model to predict the rent of the apartment in question.

Model Fit

Before answering the above questions and doing some key diagnostics, we must load the data and fit a model. We choose a simple linear model and directly model rent.

Notes:

For rent indices as well as house prices, one often log-transforms the target variable before modelling or one uses a log-link and an appropriate loss function (e.g. Gamma deviance).
Our Python version uses GeneralizedLinearRegressor from the package glum. We could as well have chosen other implementations like statsmodels.regression.linear_model.OLS. This way, we have to implement the residual diagnostics ourselves which makes it clear what is plotted.

For brevity, we skip imports and data loading. Our model is then fit by:

Python

lm = glum.GeneralizedLinearRegressor(
    alpha=0,
    drop_first=True,  # this is very important if alpha=0
    formula="bs(area, degree=3, df=4) + yearc"
      	" + C(location) + C(bath) + C(kitchen) + C(cheating)"
)
lm.fit(X_train, y_train)

model = lm(
  formula = rent ~ bs(area, degree = 3, df = 4) + yearc + location + bath + kitchen + cheating,
  data = df_train
)

Diagnostics for Inference

The coefficient table will already tell us the effect of the cheating variable. For more involved models like gradient boosted trees or neural nets, one can use partial dependence and shap values to assess the effect of features.

Python

lm.coef_table(X_train, y_train)

summary(model)
confint(model)

Variable	coef	se	p_value	ci_lower	ci_upper
intercept	-3682.5	327.0	0.0	-4323	-3041
bs(area, ..)[1]	88.5	31.3	4.6e-03	27	150
bs(area,..)[2]	316.8	24.5	0.0	269	365
bs(area, ..)[3]	547.7	62.8	0.0	425	671
bs(area, ..)[4]	733.7	91.7	1.3e-15	554	913
yearc	1.9	0.2	0.0	1.6	2.3
C(location)[2]	48.2	5.9	4.4e-16	37	60
C(location)[3]	137.9	27.7	6.6e-07	84	192
C(bath)[1]	50.0	16.5	2.4e-03	18	82
C(kitchen)[1]	98.2	18.5	1.1e-07	62	134
C(cheating)[1]	107.8	10.6	0.0	87.0	128.6

We see that ceteris paribus, meaning all else equal, a central heating increases the monthly rent by about 108 EUR. Not the size of the effect of 108 EUR, but the fact that there is an effect of central heating on the rent seems statistically significant:
This is indicated by the very low probability, i.e. p-value, for the null-hypothesis of cheating having a coefficient of zero.
We also see that the confidence interval with the default confidence level of 95%: [ci_lower, ci_upper] = [87, 129].
This shows the uncertainty of the estimated effect.

For a building with 10 apartments and with an investment horizon of about 10 years, the estimated effect gives roughly a budget of 13000 EUR (range is roughly 10500 to 15500 with 95% confidence).

A good statistician should ask several further questions:

Is the dataset at hand a good representation of the population?
Are there confounders or interaction effects, in particular between cheating and other features?
Are the assumptions for the low p-value and the confidence interval of cheating valid?

Here, we will only address the last question, and even that one only partially. Which assumptions were made? The error term, \epsilon = Y - E[Y], should be homoscedastic and Normal distributed. As the error is not observable (because the true model for E[Y] is unknown), one replaces E[Y] by the model prediction \hat{E}[Y], this gives the residuals, \hat{\epsilon} = Y - \hat{E}[Y] = y - \text{fitted values}, instead. For homoscedasticity, the residuals should look like white (random) noise. Normality, on the other hand, becomes less of a concern with larger data thanks to the central limit theorem. With about 3000 data points, we are far away from small data, but it might still be a good idea to check for normality.

The diagnostic tools to check that are residual and quantile-quatile (QQ) plots.

Python

# See notebook for a definition of residual_plot.
import seaborn as sns
fig, axes = plt.subplots(ncols=2, figsize=(4.8 * 2.1, 6.4))
ax = residual_plot(model=lm, X=X_train, y=y_train, ax=axes[0])
sns.kdeplot(
    x=lm.predict(X_train),
    y=residuals(lm, X_train, y_train, kind="studentized"),
    thresh=.02,
    fill=True,
    ax=axes[1],
).set(
    xlabel="fitted",
    ylabel="studentized residuals",
    title="Contour Plot of Residuals",
)

autoplot(model, which = c(1, 2))  # from library ggfortify
# density plot of residuals
ggplot(model, aes(x = .fitted, y = .resid)) + geom_point() +
  geom_density_2d() + geom_density_2d_filled(alpha = 0.5)

The more data points one has the less informative is a scatter plot. Therefore, we put a contour plot on the right.

Visual insights:

There seems to be a larger variability for larger fitted values. This is a hint that the homoscedasticity might be violated.
The residuals seem to be centered around 0. This is a hint that the model is well calibrated (adequate).

Python

# See notebook for a definition of qq_plot.
qq_plot(lm, X_train, y_train)

autoplot(model, which = 2)

The QQ-plot shows the quantiles of the theoretical assumed distribution of the residuals on the x-axis and the ordered values of the residuals on the y-axis. In the Python version, we decided to use the studentized residuals because normality of the error implies a student (t) distribution for these residuals.

Concluding remarks:

We might do similar plots on the test sample, but we don’t necessarily need a test sample to answer the inference questions.
It is good practice to plot the residuals vs each of the features as well.

Diagnostics for Prediction

If we are only interested in predictions of the mean rent, \hat{E}[Y], we don’t care much about the probability distribution of Y. We just want to know if the predictions are close enough to the real mean of the rent E[Y]. In a similar argument as for the error term and residuals, we have to accept that E[Y] is not observable (it is the quantity that we want to predict). So we have to fall back to the observations of Y in order to judge if our model is well calibrated, i.e., close the the ideal E[Y].

Very importantly, here we make use of the test sample in all of our diagnostics because we fear the in-sample bias.

We start simple by a look at the unconditional calibration, that is the average (negative) residual \frac{1}{n}\sum(\hat{E}[Y_i]-Y_i).

Python

compute_bias(
    y_obs=np.concatenate([y_train, y_test]),
    y_pred=lm.predict(pd.concat([X_train, X_test])),
    feature=np.array(["train"] * X_train.shape[0] + ["test"] * X_test.shape[0]),
)

print(paste("Train set mean residual:", mean(resid(model))))
print(paste("Test set mean residual: ", mean(df_test$rent - predict(model, df_test))))

set	mean bias	count	stderr	p-value
train	-3.2e-12	2465	2.8	1.0
test	2.1	617	5.8	0.72

It is no surprise that bias_mean in the train set is almost zero.
This is the balance property of (generalized) linear models (with intercept term). On the test set, however, we detect a small bias of about 2 EUR per apartment on average.

Next, we have a look a reliability diagrams which contain much more information about calibration and bias of a model than the unconditional calibration above. In fact, it assesses auto-calibration, i.e. how well the model uses its own information.
An ideal model would lie on the dotted diagonal line.

Python

fig, axes = plt.subplots(ncols=2, figsize=(4.8 * 2.1, 6.4))
plot_reliability_diagram(y_obs=y_train, y_pred=lm.predict(X_train), n_bootstrap=100, ax=axes[0])
axes[0].set_title(axes[0].get_title() + f" train set (n={X_train.shape[0]})")
plot_reliability_diagram(y_obs=y_test, y_pred=lm.predict(X_test), n_bootstrap=100, ax=axes[1])
axes[1].set_title(axes[1].get_title() + f" test set (n={X_test.shape[0]})")

iso_train = isoreg(x = model$fitted.values, y = df_train$rent)
iso_test = isoreg(x = predict(model, df_test), y = df_test$rent)
bind_rows(
  tibble(set = "train", x = iso_train$x[iso_train$ord], y = iso_train$yf),
  tibble(set = "test", x = iso_test$x[iso_test$ord], y = iso_test$yf),
) |>
  ggplot(aes(x=x, y=y, color=set)) + geom_line() +
  geom_abline(intercept = 0, slope = 1, linetype="dashed") +
  ggtitle("Reliability Diagram")

Visual insights:

Graphs on train and test set look very similar.
The larger uncertainty intervals on the test set stem from the fact that is has a smaller sample size.
The model seems to lie around the diagonal indicating good auto-calibration for the largest part of the range.
Very high predicted values seem to be systematically too low, i.e. the graph is above the diagonal.

Finally, we assess conditional calibration, i.e. the calibration with respect to the features. Therefore, we plot one of our favorite graphs for each feature. It consists of:

average observed value of Y for each (binned) value of the feature
average predicted value
partial dependence
histogram of the feature (grey, right y-axis)

Python

fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 5*4), sharey=True)
for i, col in enumerate(["area", "yearc", "bath", "kitchen", "cheating"]):
    plot_marginal(
        y_obs=y_train,
        y_pred=lm.predict(X_train),
        X=X_train,
        feature_name=col,
        predict_function=lm.predict,
        ax=axes[i][0],
    )
    plot_marginal(
        y_obs=y_test,
        y_pred=lm.predict(X_test),
        X=X_test,
        feature_name=col,
        predict_function=lm.predict,
        ax=axes[i][1],
    )
    axes[i][0].set_title("Train")
    axes[i][1].set_title("Test")
    if i != 0:
        axes[i][0].get_legend().remove()
    axes[i][1].get_legend().remove()
fig.tight_layout()

xvars = c("area", "yearc", "bath", "kitchen", "cheating")
m_train = feature_effects(model, v = xvars, data = df_train, y = df_train$rent)
m_test = feature_effects(model, v = xvars, data = df_test, y = df_test$rent)

c(m_train, m_test) |> 
  plot(
    share_y = "rows",
    ncol = 2,
    byrow = FALSE,
    stats = c("y_mean", "pred_mean", "pd"),
    subplot_titles = FALSE,
    # plotly = TRUE,
    title = "Left: Train - Right: Test",
  )

Visual insights:

On the train set, the categorical features seem to have perfect calibration as average observed equals average predicted. This is again a result of the balance property. On the test set, we see a deviation, especially for the categorical level with smaller sample size. This is a good demonstration why plotting on both train and test set is a good idea.
The numerical features area and year of construction seem fine, but a closer look can’t hurt.

We next perform a bias plot, which is plotting the average difference of predicted minus observed per feature value. The values should be around zero, so we can zoom in on the y-axis.
This is very similar to the residual plot, but the information is better condensed for its purpose.

Python

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 2*4), sharey=True)
axes[0,0].set_ylim(-150, 150)
for i, col in enumerate(["area", "yearc"]):
    plot_bias(
        y_obs=y_train,
        y_pred=lm.predict(X_train),
        feature=X_train[col],
        ax=axes[i][0],
    )
    plot_bias(
        y_obs=y_test,
        y_pred=lm.predict(X_test),
        feature=X_test[col],
        ax=axes[i][1],
    )
    axes[i][0].set_title("Train")
    axes[i][1].set_title("Test")
fig.tight_layout()

c(m_train[c("area", "yearc")], m_test[c("area", "yearc")]) |> 
  plot(
    ylim = c(-150, 150),
    ncol = 2,
    byrow = FALSE,
    stats = "resid_mean",
    subplot_titles = FALSE,
    title = "Left: Train - Right: Test",
    # plotly = TRUE,
    interval = "ci"
  )

Visual insights:

For large values of area and yearc in the 1940s and 1950s, there are only few observations available. Still, the model might be improved for those regions.
The bias of yearc shows a parabolic curve. The simple linear effect in our model seems too simplistic. A refined model could use splines instead, as for area.

Concluding remarks:

The predictions for area larger than around 120 square meters and for year of construction around the 2nd world war are less reliable.
For all the rest, the bias is smaller than 50 EUR on average.
This is therefore a rough estimation of the prediction uncertainty.
It should be enough to prevent improperly high (or low) rents (on average).

The full Python and R code is available under:

To leave a comment for the author, please follow the link and comment on their blog: Python – Michael's and Christian's Blog .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers

Model Diagnostics: Statistics vs Machine Learning

Inference

Prediction

Model Fit

Diagnostics for Inference

Diagnostics for Prediction

Related