This article was first published on Python – Michael's and Christian's Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)

TLDR: This second part of the trilogy will have a deeper look at offsets and sample weights of a GLM. Their non-equivalence stems from the mean-variance relationship. This time, we not only have a Poisson frequency but also a Gamma severity model.

This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

See part I.

From Mean-Variance Relation to Score Equations

In the part I, we have already introduced the mean-variance relation of a Tweedie random variable Y\sim Tw_p(\mu, \phi) with Tweedie power p, mean \mu and dispersion parameter \phi:

\begin{align*}
\operatorname{E}[Y] &= \mu
\\
\operatorname{Var}[Y] &= \phi \mu^p = \phi v(\mu)
\end{align*}

with variance function v(\mu).

This variance function directly impacts the estimation of GLMs. Assume the task is to estimate the expectation of a random variableY_i\sim Tw_p(\mu_i, \phi/w_i), given observations of the target y_i and of explanatories variables, aka features, x_i\in R^k. A GLM then assumes a link functiong(\mu_i) = \sum_{j=1}^k x_{ij}\beta_jwith coefficients \beta to be estimated via an optimization procedure, of which the first order condition, also called score equation, reads

\begin{equation*}
\sum_i w_i \frac{y_i - \mu_i}{v(\mu_i)g'(\mu_i)} x_{ij}= 0 \quad \forall j =1, \ldots, k
\end{equation*}

This shows that the higher the Tweedie power p, entering via v(\mu) only, the less weight is given to deviations of large values. In other words, higher Tweedie powers result in GLMs that are less and less sensitive to what happens at large (expected) values.

This is also reflected in the deviance loss function. They can be derived from the negative log-likelihood and are given by

\begin{equation*}
d_p(y, \mu) =
2 \cdot
\begin{cases}
\frac{\max(0, y^{2-p})}{(1-p)(2-p)}-\frac{y\mu^{1-p}}{1-p}+\frac{\mu^{2-p}}{2-p} & p \in \mathrm{R}\setminus (0,1] \cup \{2\} \\
y\log\frac{y}{\mu} - y + \mu & p=0 \\
\frac{y}{\mu} - \log\frac{y}{\mu} - 1 & p=2
\end{cases}
\end{equation*}

These are the only strictly consistent scoring functions for the expectation (up to one multiplicative and one additive constant) that are homogeneous functions (of degree 2-p), see, e.g., Fissler et al (2022). The Poisson deviance (p=1), for example, has a degree of homogeneity of 1 and the same unit as the target variable. The Gamma deviance (p=2), on the other side, is zero-homogeneous and is completely agnostic to the scale of its arguments. This is another way of stating the above: the higher the Tweedie power the less it cares about large values.

It is also connected to the fact that Tweedie distributions are the only distributions from the exponential dispersion family that are closed under scale transformations:

\begin{align*}
Y &\sim Tw_p(\mu, \phi) \\
cY &\sim Tw_p(c\mu, c^{2-p}\phi) \quad \forall c>0
\end{align*}

Offsets and Sample Weights

Poisson GLM

When estimating counts with a Poisson GLM, there is often an exposure measure like time under consideration or underlying number of things (insurance policies, trees in a forest, radioactive atoms). One then often finds two different, but equivalent formulations of a Poisson GLM with log-link.

• Sample weights: Model frequency y=\frac{N}{w} and fit with sample weights w to estimate \operatorname{E}[y] = \mu_y = \exp(x \beta).
• Offsets: Model counts N, but account for the exposure w via an offset as \operatorname{E}[N]=\mu_N = \exp(x \beta + \log(w)) = w \mu_y.

Note that each way models a different target, so we had to use subscripts to distinguish the mean parameters \mu.

In this special case of a Poisson GLM with (canonical) log-link, both models are equivalent and will result in the exact same parameters \beta. You can plug it into the score equation to convince yourself.

Tweedie GLM

Very importantly, this simple equivalence of GLM fomulations with offsets and with sample weights does only hold for the Poisson GLM with log-link. It does not hold for any other Tweedie parameter or even other distributions from the exponential dispersion family.

One can show that a Tweedie GLM with log-link and offset (additive in link space) \log(u) on target y with weights w is equivalent to the same Tweedie GLM but with target \frac{y}{u} and weights w u^{2-p}.

So one can construct an equivalence between unweighted with offsets and weighted without offsets by setting u = \sqrt[2-p]{w}. But note that this does not work for a Gamma GLM which as p=2.

Example

We continue with the same dataset and model as in part I and show this (non-) equivalence with the offsets.

from glum import GeneralizedLinearRegressor
import pandas as pd

# ... quite some code ... here we abbreviate.
# Model frequency with weights (but without offsets)
y_freq = df["ClaimNb"] / df["Exposure"]
w_freq = df["Exposure"]
X = df[x_vars]
glm_params = {
"alpha": 0,
"drop_first": True,
}
glm_freq = GeneralizedLinearRegressor(
family="poisson", **glm_params
).fit(X, y_freq, sample_weight=w_freq)

# Model counts N = w * freq with offsets (but without weights)
N = w_freq * y_freq
glm_offset_freq = GeneralizedLinearRegressor(
family="poisson", **glm_params
).fit(X, N, offset=np.log(w_freq))

print(
f"intercept freq{'':<8}= {glm_freq.intercept_}\n"
f"intercept freq offset = {glm_offset_freq.intercept_}"
)
# intercept freq        = -3.756437676421677
# intercept freq offset = -3.7564376764216725

np.max(np.abs(glm_freq.coef_ - glm_offset_freq.coef_)) < 1e-13
# True

As next, we model the severity Y = \frac{loss}{N} with claim counts N as weights. As is standard, we use a Gamma GLM with log-link (which is not canonical this time).

# Model severity with weights (but without offsets)
y_sev = (df["ClaimAmount"] / df["ClaimNb"])
w_sev = df["ClaimNb"].fillna(0)
X = df[x_vars]
# Filter out zero count (w_sev==0) rows
w_gt_0 = w_sev > 0
y_sev = y_sev[w_gt_0]
X_sev = X[w_gt_0]
w_sev = w_sev[w_gt_0]

glm_sev = GeneralizedLinearRegressor(
family="gamma", **glm_params
).fit(X_sev, y_sev, sample_weight=w_sev)

# Note that the target is claim amount = w * sev.
claim_amount = w_sev * y_sev
glm_offset_sev = GeneralizedLinearRegressor(
family="gamma", **glm_params
).fit(X_sev, claim_amount, offset=np.log(w_sev))

print(
f"intercept sev{'':<8}= {glm_sev.intercept_}\n"
f"intercept sev offset = {glm_offset_sev.intercept_}"
)
# intercept sev        = 7.287909799461992
# intercept sev offset = 7.236827150674156

np.max(np.abs(glm_sev.coef_ - glm_offset_sev.coef_))
# 0.2119162919285421

The deviations might seem small, but they are there and add up:

print(
"Total predicted claim amounts with weights "
f"{np.sum(w_sev * glm_sev.predict(X_sev)):_.2f}"
)
print(
"Total predicted claim amounts offset       "
f"{np.sum(glm_offset_sev.predict(X_sev, offset=np.log(w_sev))):_.2f}"
)
# Total predicted claim amounts with weights 49_309_687.30
# Total predicted claim amounts offset       48_769_342.47

Here, it becomes evident that the two models are quite different.

Outlook

The full notebook can be found here.

The final part III of the Tweedie trilogy will follow in one week and go into details of the probability density function.