A Tweedie Trilogy — Part III: From Wrights Generalized Bessel Function to Tweedie’s Compound Poisson Distribution

This article was first published on Python – Michael's and Christian's Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

TLDR: The scipy 1.7.0 release introduced Wright’s generalized Bessel function in the Python ecosystem. It is an important ingredient for the density and log-likelihood of Tweedie probabilty distributions. In this last part of the trilogy I’d like to point out why it was important to have this function and share the endeavor of implementing this inconspicuous but highly intractable special function. The fun part is exploiting a free parameter in an integral representation, which can be optimized by curve fitting to the minimal arc length.

This trilogy celebrates the 40th birthday of Tweedie distributions in 2024 and highlights some of their very special properties.

See part i and part ii.

Tweedie Distributions

As pointed out in part I and part II, the family of Tweedie distributions is a very special one with outstanding properties. They are central for estimating expectations with GLMs. The probability distributions have mainly positive (non-negative) support and are skewed, e.g. Poisson, Gamma, Inverse Gaussian and compound Poisson-Gamma.

As members of the exponential dispersion family, a slight extension of the exponential family, the probability density can be written as

\begin{align*}
f(y; \theta, \phi) &= c(y, \phi) \exp\left(\frac{y\theta - \kappa(\theta)}{\phi}\right)
\\
\kappa(\theta) &= \kappa_p(\theta) = \frac{1}{2-p}((1-p)\theta)^{\frac{2-p}{1-p}}
\end{align*}

It is often more instructive to parametrise the distribution with p, \mu and \phi, using

\begin{align*}
\theta &= \begin{cases}
\frac{\mu^{1-p}}{1-p}\,,\quad p\neq 1\\
\log(\mu)\,,\quad p=1
\end{cases}
\\
\kappa(\theta) &= \begin{cases}
\frac{\mu^{2-p}}{2-p}\,,\quad p\neq 2\\
\log(\mu)\,,\quad p=2
\end{cases}
\end{align*}

and write

\begin{align*}
Y &\sim \mathrm{Tw}_p(\mu, \phi)
\end{align*}
Probability density of several Tweedie distributions.

Compound Poisson Gamma

A very special domain for the power parameter is between Poisson and Gamma: 1<p<2. This range results in the Compound Poisson distribution which is suitable if you have a random count process and if each count itself has a random amount. A well know example is insurance claims. Typically, there is a random number of insurance claims, and each and every claim has a random amount of claim costs.

\begin{align*}
N &\sim \mathrm{Poisson}(\lambda)\\
X_i &\sim \mathrm{Gamma}(a, b)\\
Y &= \sum_{i=0}^N X_i \sim \mathrm{CompPois}(\lambda, a, b)
\end{align*}

For Poisson count we have \operatorname{E}[N]=\lambda and \operatorname{Var}[N]=\lambda=\operatorname{E}[N], for the Gamma amount \operatorname{E}[X]=\frac{a}{b} and \operatorname{Var}[X]=\frac{a}{b^2}=\frac{1}{a}\operatorname{E}[X]^2. For the compound Poisson-Gamma variable, we obtain

\begin{align*}
\operatorname{E}[Y] &= \operatorname{E}[N] \operatorname{E}[X] = \lambda\frac{a}{b}=\mu\\
\operatorname{Var}[Y] &=  \operatorname{Var}[N] \operatorname{E}[X]^2 +  \operatorname{E}[N] \operatorname{Var}[X] =  \phi \mu^p\\
p &= \frac{a + 2}{a+1} \in (1, 2)\\
\phi &= \frac{(\lambda a)^{1-p}}{(p-1)b^{2-p}}
\end{align*}

What’s so special here is that there is a point mass at zero, i.e., P(Y=0)=\exp(-\frac{\mu^{2-p}}{\phi(2-p)}) > 0. Hence, it is a suitable distribution for non-negative quantities with some exact zeros.

Probability density for compound Poisson Gamma, point masses at zero are marked as points.

Code

The rest of this post is about how to compute the density for this parameter range. The easy part is \exp\left(\frac{y\theta - \kappa(\theta)}{\phi}\right) which can be directly implemented. The real obstacle is the term c(y, \phi) which is given by

\begin{align*}
c(y, \phi) &= \frac{\Phi(-\alpha, 0, t)}{y}
\\
\alpha &= \frac{2 - p}{1 - p}
\\
t &= \frac{\left(\frac{(p - 1)\phi}{y}\right)^{\alpha}}{(2-p)\phi}
\end{align*}

This depends on Wright’s (generalized Bessel) function \Phi(a, b, z) as introduced in a 1933 paper by E. Wright.

Wright’s Generalized Bessel Function

According to DLMF 10.46, the function is defined as

\begin{equation*}
\Phi(a, b, z) = \sum_{k=0}^{\infty} \frac{z^k}{k!\Gamma(ak+b)}, \quad a > -1, b \in R, z \in C
\end{equation*}

which converges everywhere because it is an entire function. We will focus on the positive real axis z=x\geq 0 and the range a\geq 0, b\geq 0 (note that a=-\alpha \in (0,\infty) for 1<p<2). For the compound Poisson-Gamma, we even have b=0.

Implementation of such a function as done in scipy.stats.wright_bessel, even for the restricted parameter range, poses tremendous challenges. The first one is that it has three parameters which is quite a lot. Then the series representation above, for instance, can always be used, but depending on the parameters, it will require a huge amount of terms, particularly for large x. As each term involves the Gamma function, this becomes expensive very fast. One ends up using different representations and strategies for different parameter regions:

  • Small x: Taylor series according to definition
  • Small a: Taylor series in a=0
  • Large x: Asymptotic series due to Wright (1935)
  • Large a: Taylor series according to definition for a few terms around the approximate maximum term k_{max} due to Dunn & Smyth (2005)
  • General: Integral represantation due to Luchko (2008)

Dunn & Smyth investigated several evaluation strategies for the simpler Tweedie density which amounts to Wright’s functions with \beta=0, see Dunn & Smyth (2005). Luchko (2008) lists most of the above strategies for the full Wright’s function.

The Integral Representation

This brings us deep into complex analysis: We start with Hankel’s contour integral representation of the reciprocal Gamma function.

\begin{equation*}
\frac{1}{\Gamma(z)} = \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-z} e^\zeta \; d\zeta
\end{equation*}

with the Hankel path Ha^- from negative infinity (A) just below the real axis, counter-clockwise with radius \epsilon>0 around the origin and just above the real axis back to minus infinity (D).

Hankel contour Ha in the complex plane.

In principle, one is free to choose any such path with the same start (A) and end point (D) as long as one does not cross the negative real axis. One usually lets the AB and CD be infinitesimal close to the negative real line. Very importantly, the radius \epsilon>0 is a free parameter! That is real magic🪄

By interchanging sum and integral and using the series of the exponential, Wright’s function becomes

\begin{align*}
\Phi(a, b, z) &= \sum_{k=0}^{\infty} \frac{z^k}{k!} \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-(ak+b)} e^\zeta \; d\zeta
\\
&= \frac{1}{2\pi i} \int_{Ha^-} \zeta^{-b} e^{\zeta + z\zeta^{-a}} \; d\zeta
\end{align*}

Now, one needs to do the tedious work and split the integral into the 3 path sections AB, BC, CD. Putting AB and CD together gives an integral over K, the circle BC gives an integral over P:

\begin{align*}
\Phi(a, b, x) &= \frac{1}{\pi} \int_{\epsilon}^\infty K(a, b, x, r) \; dr
\\
 &+ \frac{\epsilon^{1-b}}{\pi} \int_0^\pi P(\epsilon, a, b, x, \varphi) \; d\varphi
\\
K(a, b, x, r) &= r^{-b}\exp(-r + x  r^{-a} \cos(\pi a)) 
\\
&\quad \sin(x \cdot r^{-a} \sin(\pi a) + \pi b)
\\
P(\epsilon, a, b, x, \varphi) &= \exp(\epsilon \cos(\varphi) + x  \epsilon^{-a}\cos(a \varphi))
\\
&\quad \cos(\epsilon \sin(\varphi) - x \cdot \epsilon^{-a} \sin(a \varphi) + (1-b) \varphi)
\end{align*}

What remains is to carry out the numerical integration, also known as quadrature. While this is an interesting topic in its own, let’s move to the magic part.

Arc Length Minimization

If you have come so far and say, wow, puh, uff, crazy, 🤯😱 Just keep on a little bit because here comes the real fun part🤞

It turns out that most of the time, the integral over P is the most difficult. The worst behaviour an integrand can have is widely oscillatory. Here is one of my favorite examples:

Integrands for a=5, b=1, x=100 and two choices of epsilon.

With the naive choice of \epsilon=1, both integrands (blue) are—well—crazy. There is basically no chance the most sophisticated quadrature rule will work. And then look at the other choice of \epsilon\approx 4. Both curves seem well behaved (for P, we would need a closer look).

So the idea is to find a good choice of \epsilon to make P well behaved. Well behaved here means most boring, if possible a straight line. What makes a straight line unique? In flat space, it is the shortest path between two points. Therefore, well behaved integrands have minimal arc length. That is what we want to minimize.

The arc length S from x=a to x=b of a 1-dimensional function f is given by

\begin{equation*}
S = \int_a^b \sqrt{1 + f^\prime(x)^2} \; dx
\end{equation*}

Instead of f=P, we only take the oscillatory part of P and approximate the arc length as f(\varphi)=f(\varphi) = \epsilon \sin(\varphi) - x \epsilon^{-\rho} \sin(\rho \varphi) + (1-\beta) \varphi. For a single parameter point a, b, z this looks like

Arc length and integrand P for different epsilon, given a=0.1, b=5, x=100.

Note the logarithmic y-scale for the right plot of P. The optimal \epsilon=10 is plotted in red and behaves clearly better than smaller values of \epsilon.

What remains to be done for an actual implementation is

  • Calculate minimal \epsilon for a large grid of values a, b, x.
  • Choose a function with some parameters.
  • Curve fitting (so again optimisation): Fit this function to the minimal \epsilon of the grid via minimising least squares.
  • Implement some quadrature rules and use this choice of \epsilon in the hope that it intra- and extrapolates well.

This strategy turns out to work well in practice and is implemented in scipy. As the parameter space of 3 variables is huge, the integral representation breaks down in certain areas, e.g. huge values of \epsilon where the integrands just overflow numerically (in 64-bit floating point precision). But we have our other evaluation strategies for that.

Conclusion

An extensive notebook for Wright’s function, with all implementation strategies can be found here.

After an adventurous journey, we arrived at one implementation strategy of Wright’s generalised Bessel function, namely the integral representation. The path went deep into complex analysis and contour integration, then further to the arc length of a function and finally curve fitting via optimisation. I am really astonished how connected all those different areas of mathematics can be.

Wright’s function is the missing piece to compute full likelihoods and probability functions of the Tweedie distribution family and is now available in the Python ecosystem via scipy.

We are at the very end of this Tweedie trilogy. I hope it has been entertaining and it has become clear why Tweedie deserves to be celebrated.

Further references:

To leave a comment for the author, please follow the link and comment on their blog: Python – Michael's and Christian's Blog .

Want to share your content on python-bloggers? click here.