The Bootstrap Chain Ladder from Scratch in Python

This article was first published on The Pleasure of Finding Things Out: A blog by James Triveri , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

The bootstrap chain ladder is a statistical technique used in actuarial reserving to project future claim liabilities based on historical data. It builds upon the chain ladder method, but the chain ladder method by itself does not provide a measure of the uncertainty in its projections, which is where the bootstrap technique comes in. The bootstrap chain ladder provides an empirical distribution of reserve estimates, which can be used to quantify the uncertainty and variability around the estimates generated by the chain ladder.

The bootstrap chain ladder works by repeatedly resampling residuals from the original chain ladder model and creating a series of simulated datasets, each of which mimics the original data in its pattern of claims development. The most commonly used approach was originally outlined by England and Verrall in Stochastic Claims Reserving in General Insurance, which notably described a technique that can be used to incorporate process variance, allowing for the quantification of the full prediction error of the total needed reserve.

The bootstrap chain ladder is simple from a technical standpoint, but it is worth highlighting the DataFrame operations required to implement the full algorithm. The code used here leverages the Pandas library, but a future post will reproduce the full estimator pipeline using Polars.

Much of what follows is a direct implementation of Appendix 3 from Two Approaches to Calculating Correlated Reserve Indications Across Multiple Lines of Business, Kirshner, et. al. Also, many of the DataFrame operations exhibited here can be implemented more efficiently, but the goal of this article was to present the bootstrap chain ladder with maximal clarity rather than focusing on optimal performance.

The steps to perform the bootstrap chain ladder:

  1. Transform loss data into cumulative triangle representation.

  2. Calculate all-year volume weighted age-to-age factors.

  3. Calculate the cumulative fitted triangle by applying backwards recursion, beginning with the observed cumulative losses from the latest diagonal.

  4. Calculate the unscaled Pearson residuals, , degrees of freedom and scale parameter .

  5. Calculate the adjusted Pearson residuals, defined as .

  6. For each bootstrap sample (1…1000):

    1. Generate a sample from the adjusted Pearson residuals with replacement.

    2. Using the sampled adjusted Pearson residuals and fitted incremental triangle , construct the triangle of sampled incremental losses , where represents a sample with replacement from the adjusted Pearson residuals and the fitted incremental triangle.

    3. Create a cumulative triangle using the result from ii., and project future losses using the chain ladder method.

    4. Incorporate process variance by simulating each future projected incremental loss from a gamma distribution parameterized with mean equal to the projected loss value and variance the loss value times .

    5. Cumulate the incremental losses, then compute the total needed reserve as the ultimate projected value minus the latest cumulative loss amount by origin period.

  7. Compute desired quantiles of interest from predictive distribution of bootstrap samples.

In what follows each step is demonstrated, along with exhibits to visually assess the distribution of future losses.

%load_ext watermark

import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.set_printoptions(suppress=True, precision=5)
pd.set_option("display.precision", 5)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
warnings.simplefilter(action="ignore", category=FutureWarning)

%watermark --python --conda --hostname --machine --iversions
Python implementation: CPython
Python version       : 3.11.10
IPython version      : 8.28.0

conda environment: py311

Compiler    : MSC v.1941 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 170 Stepping 4, GenuineIntel
CPU cores   : 22
Architecture: 64bit

Hostname: JTRIZPC11

numpy     : 2.1.0
pandas    : 2.2.2
matplotlib: 3.9.2

Begin by loading the data from GitHub and transforming it into an incremental triangle:

# Load RAA dataset. 
dfraa = pd.read_csv("https://gist.githubusercontent.com/jtrive84/976c80786a6e97cce7483e306562f85b/raw/06a5c8b1f823fbe2b6da15f90a672517fa5b4571/RAA.csv")
dfraa = dfraa.sort_values(by=["ORIGIN", "DEV"])
tri0 = dfraa.pivot(index="ORIGIN", columns="DEV").rename_axis(None)
tri0.columns = tri0.columns.droplevel(0)
tri0.columns.name = None

print("Original incremental loss triangle:")

tri0
Original incremental loss triangle:
12345678910
19815012.03257.02638.0898.01734.02642.01828.0599.054.0172.0
1982106.04179.01111.05270.03116.01817.0-103.0673.0535.0NaN
19833410.05582.04881.02268.02594.03479.0649.0603.0NaNNaN
19845655.05900.04211.05500.02159.02658.0984.0NaNNaNNaN
19851092.08473.06271.06333.03786.0225.0NaNNaNNaNNaN
19861513.04932.05257.01233.02917.0NaNNaNNaNNaNNaN
1987557.03463.06926.01368.0NaNNaNNaNNaNNaNNaN
19881351.05596.06165.0NaNNaNNaNNaNNaNNaNNaN
19893133.02262.0NaNNaNNaNNaNNaNNaNNaNNaN
19902063.0NaNNaNNaNNaNNaNNaNNaNNaNNaN

A number of functions are defined that will be used throughout the remainder of the article:

  • to_cum: Accepts an incremental triangle and returns a cumulative triangle.

  • to_incr: Accepts a cumulative triangle and returns an incremental triangle.

  • get_a2a_factors: Accepts a cumulative triangle and returns the all-year volume weighted age-to-age factors.

  • get_latest: Accepts a triangle and returns the value at the latest development period by origin.

  • square_tri: Accepts a cumulative triangle and set of age-to-age factors and projects future losses, populating the lower-right of the original cumulative triangle.

For get_a2a_factors, get_latest and square_tri, a few simplifying assumptions have been made:

  1. The triangles under consideration have an equal number of development and origin periods.
  2. Development periods are sequential starting with 1.
  3. No tail factor is included.
def to_cum(tri: pd.DataFrame) -> pd.DataFrame:
    """
    Accepts a DataFrame structured as an incremental triangle and 
    returns a cumulative triangle.

    Parameters
    ----------
    tri : pd.DataFrame
        Incremental triangle.

    Returns
    -------
    pd.DataFrame
        Cumulative triangle
    """
    return tri.cumsum(axis=1)


def to_incr(tri: pd.DataFrame) -> pd.DataFrame:
    """
    Accepts a DataFrame structured as an cumulative triangle and 
    returns an incremental triangle.

    Parameters
    ----------
    tri : pd.DataFrame
        Cumulative triangle.

    Returns
    -------
    pd.DataFrame
        Incremental triangle.       
    """
    tri_incr = tri.diff(axis=1)
    tri_incr.iloc[:, 0] = tri.iloc[:, 0]
    return tri_incr


def get_a2a_factors(tri: pd.DataFrame) -> pd.Series:
    """
    Calculate all-year volume weighted age-to-age factors. 

    Parameters
    ----------
    tri : pd.DataFrame
        Cumulative triangle.

    Returns
    -------
    pd.Series
        All-year volume weighted age-to-age factors.
    """
    all_devps = tri.columns.tolist()
    min_origin, max_origin = tri.index.min(), tri.index.max()
    dps0, dps1 = all_devps[:-1], all_devps[1:]  
    a2a_headers = [f"{ii}-{jj}" for ii, jj in zip(dps0, dps1)]
    a2a = []

    for dp1, dp0 in zip(dps1[::-1], dps0[::-1]):
        vals1 = tri.loc[min_origin:(max_origin - dp0), dp1].sum()
        vals0 = tri.loc[min_origin:(max_origin - dp0), dp0].sum()
        a2a.append((vals1 / vals0).item())

    return pd.Series(data=a2a[::-1], index=a2a_headers)
    

def get_latest(tri: pd.DataFrame) -> pd.Series:
    """
    Return the value at the latest development period by origin. 

    Parameters
    ----------
    tri : pd.DataFrame
        Cumulative or incremental triangle.

    Returns
    -------
    pd.Series
    """
    nbr_devps = tri.shape[1]
    latest = [tri.iat[ii, nbr_devps - ii - 1].item() for ii in range(nbr_devps)]
    return pd.Series(data=latest, index=tri.index, name="latest")


def square_tri(tri: pd.DataFrame, a2a: pd.Series) -> pd.DataFrame:
    """
    Project future losses for `tri` using `a2a`.

    Parameters
    ----------
    tri : pd.DataFrame
        Cumulative triangle.
    
    a2a: pd.Series
        Age-to-age factors.

    Returns
    -------
    pd.DataFrame
        Original triangle with projected future losses. 
    """
    nbr_devps = tri.shape[1]
    for r_idx in range(1, nbr_devps):
        for c_idx in range(nbr_devps - r_idx, nbr_devps):
            tri.iat[r_idx, c_idx] = tri.iat[r_idx, c_idx - 1]  * a2a.iat[c_idx - 1]
    return tri

Use to_cum to get a cumulative triangle:

ctri0 = to_cum(tri0)

print("Original cumulative triangle:")

ctri0
Original cumulative triangle:
12345678910
19815012.08269.010907.011805.013539.016181.018009.018608.018662.018834.0
1982106.04285.05396.010666.013782.015599.015496.016169.016704.0NaN
19833410.08992.013873.016141.018735.022214.022863.023466.0NaNNaN
19845655.011555.015766.021266.023425.026083.027067.0NaNNaNNaN
19851092.09565.015836.022169.025955.026180.0NaNNaNNaNNaN
19861513.06445.011702.012935.015852.0NaNNaNNaNNaNNaN
1987557.04020.010946.012314.0NaNNaNNaNNaNNaNNaN
19881351.06947.013112.0NaNNaNNaNNaNNaNNaNNaN
19893133.05395.0NaNNaNNaNNaNNaNNaNNaNNaN
19902063.0NaNNaNNaNNaNNaNNaNNaNNaNNaN

Calculate the all-year volume weighted age-to-age factors

for development period .

a2a = get_a2a_factors(ctri0)

print("All-year volume weighted age-to-age factors:")

a2a
All-year volume weighted age-to-age factors:
1-2     2.99936
2-3     1.62352
3-4     1.27089
4-5     1.17167
5-6     1.11338
6-7     1.04193
7-8     1.03326
8-9     1.01694
9-10    1.00922
dtype: float64

Although not required here, age-to-ultimate factors can be obtained as follows:

a2u = np.cumprod(a2a[::-1])[::-1]

print("Age-to-ultimate factors:")

a2u
Age-to-ultimate factors:
1-2     8.92023
2-3     2.97405
3-4     1.83185
4-5     1.44139
5-6     1.23020
6-7     1.10492
7-8     1.06045
8-9     1.02631
9-10    1.00922
dtype: float64

Calculate the cumulative fitted triangle by applying backwards recursion, beginning with the observed cumulative losses from the latest diagonal.

nbr_devps = ctri0.shape[1]

# Create empty triangle with same shape as tri0. 
ctri = pd.DataFrame(columns=tri0.columns, index=tri0.index, dtype=float)

for idx, origin in enumerate(ctri0.index):

    # Determine latest development period.
    latest_devp = nbr_devps - idx

    # Set latest diagonal of tri to same value as in tri0.
    ctri.at[origin, latest_devp] = ctri0.at[origin, latest_devp]

    # Use backward recursion to un-develop triangle using a2a. 
    for devp in range(latest_devp - 1, 0, -1):
        ctri.at[origin, devp] = (ctri.at[origin, devp + 1] / a2a.iloc[devp - 1]).astype(float)

print("Fitted cumulative triangle:")

ctri
Fitted cumulative triangle:
12345678910
19812111.379616332.7847110281.4200713066.5345815309.7271117045.6187717760.4206218351.1953318662.018834.0
19821889.855595668.354729202.7028711695.6057013703.4445215257.2080115897.0135116425.8046716704.0NaN
19832699.858688097.8444913147.0347916708.4102619576.8204621796.5360222710.5658723466.00000NaNNaN
19843217.756679651.2063215668.9530619913.4862223332.1266625977.6371927067.00000NaNNaNNaN
19853242.822639726.3881115791.0124120068.6099923513.8812526180.00000NaNNaNNaNNaN
19862186.165016557.0929210645.5895613529.3532515852.00000NaNNaNNaNNaNNaN
19871989.779955968.063729689.2872412314.00000NaNNaNNaNNaNNaNNaN
19882692.663988076.2650013112.00000NaNNaNNaNNaNNaNNaNNaN
19891798.717875395.00000NaNNaNNaNNaNNaNNaNNaNNaN
19902063.00000NaNNaNNaNNaNNaNNaNNaNNaNNaN

Calculate the unscaled Pearson residuals, , degrees of freedom and scale parameter .

The unscaled Pearson residuals are defined as

where represents actual incremental losses and fitted incremental losses.

To get incremental representations of ctri0 and ctri, do:

# Actual incremental triangle.
tri0 = to_incr(ctri0)

print("Actual incremental triangle tri0:")

tri0
Actual incremental triangle tri0:
12345678910
19815012.03257.02638.0898.01734.02642.01828.0599.054.0172.0
1982106.04179.01111.05270.03116.01817.0-103.0673.0535.0NaN
19833410.05582.04881.02268.02594.03479.0649.0603.0NaNNaN
19845655.05900.04211.05500.02159.02658.0984.0NaNNaNNaN
19851092.08473.06271.06333.03786.0225.0NaNNaNNaNNaN
19861513.04932.05257.01233.02917.0NaNNaNNaNNaNNaN
1987557.03463.06926.01368.0NaNNaNNaNNaNNaNNaN
19881351.05596.06165.0NaNNaNNaNNaNNaNNaNNaN
19893133.02262.0NaNNaNNaNNaNNaNNaNNaNNaN
19902063.0NaNNaNNaNNaNNaNNaNNaNNaNNaN
# Fitted incremental triangle.
tri = to_incr(ctri)

print("Fitted incremental triangle tri:")

tri
Fitted incremental triangle tri:
12345678910
19812111.379614221.405103948.635362785.114502243.192531735.89167714.80185590.77471310.80467172.0
19821889.855593778.499133534.348152492.902832007.838821553.76350639.80549528.79116278.19533NaN
19832699.858685397.985815049.190303561.375472868.410202219.71556914.02985755.43413NaNNaN
19843217.756676433.449656017.746744244.533163418.640442645.510531089.36281NaNNaNNaN
19853242.822636483.565486064.624304277.597593445.271262666.11875NaNNaNNaNNaN
19862186.165014370.927924088.496642883.763692322.64675NaNNaNNaNNaNNaN
19871989.779953978.283763721.223522624.71276NaNNaNNaNNaNNaNNaN
19882692.663985383.601025035.73500NaNNaNNaNNaNNaNNaNNaN
19891798.717873596.28213NaNNaNNaNNaNNaNNaNNaNNaN
19902063.00000NaNNaNNaNNaNNaNNaNNaNNaNNaN

The unscaled Pearson residuals are then calculated element-wise:

r_us = (tri0 - tri) / tri.abs().map(np.sqrt)

print("Unscaled Pearson residuals:")

r_us
Unscaled Pearson residuals:
12345678910
198163.12592-14.84332-20.85731-35.75829-10.7510021.7479741.637020.33841-14.566630.0
1982-41.034146.51544-40.7625355.6209524.730826.67811-29.366436.2711915.39671NaN
198313.667032.50458-2.36696-21.67284-5.1236526.72854-8.76627-5.54605NaNNaN
198442.96574-6.65076-23.2905819.27038-21.543680.24282-3.19228NaNNaNNaN
1985-37.7696524.707152.6500731.426565.80493-47.27692NaNNaNNaNNaN
1986-14.397278.4865618.27461-30.7400912.33255NaNNaNNaNNaNNaN
1987-32.12011-8.1695652.53574-24.52986NaNNaNNaNNaNNaNNaN
1988-25.855482.8947815.91345NaNNaNNaNNaNNaNNaNNaN
198931.46054-22.24953NaNNaNNaNNaNNaNNaNNaNNaN
19900.00000NaNNaNNaNNaNNaNNaNNaNNaNNaN

, where is the number of populated cells in the original triangle and the number of parameters in the chain ladder model (10 for origin period and 9 for development period):

n = tri0.count().sum().item()
p = tri0.index.shape[0] + tri0.columns.shape[0] - 1
DF = n - p

print(f"Degrees of freedom: {DF}.")
Degrees of freedom: 36.

The scale parameter is the sum of the squared unscaled Pearson residuals over the degrees of freedom:

phi = ((r_us**2).sum().sum() / DF).item()

print(f"Scale parameter: {phi:.3f}.")
Scale parameter: 983.635.

Calculate the adjusted Pearson residuals, , defined as:

r_adj = np.sqrt(n / DF) * r_us 

print("Adjusted Pearson residuals:")

r_adj
Adjusted Pearson residuals:
12345678910
198178.02573-18.34683-25.78033-44.19843-13.2885926.8812251.464730.41828-18.004840.0
1982-50.719568.05330-50.3838468.7493330.568118.25437-36.297887.7514019.03085NaN
198316.892913.09575-2.92564-26.78834-6.3330033.03735-10.83539-6.85510NaNNaN
198453.10708-8.22056-28.7879323.81883-26.628700.30014-3.94576NaNNaNNaN
1985-46.6845430.538863.2755738.844277.17509-58.43584NaNNaNNaNNaN
1986-17.7955010.4896722.58802-37.9957615.24345NaNNaNNaNNaNNaN
1987-39.70151-10.0978464.93591-30.31972NaNNaNNaNNaNNaNNaN
1988-31.958233.5780519.66955NaNNaNNaNNaNNaNNaNNaN
198938.88627-27.50115NaNNaNNaNNaNNaNNaNNaNNaN
19900.00000NaNNaNNaNNaNNaNNaNNaNNaNNaN

(From this point each subsequent step is repeated up to the desired number of bootstrap samples.)

Generate a sample from the adjusted Pearson residuals with replacement:

# Set random seed for reproducibility.
rng = np.random.default_rng(seed=516)

# Represent adjusted residuals as Numpy array with nans and 0s removed.
r = r_adj.iloc[:-1, :-1].astype(float).values.flatten()
r = r[np.logical_and(~np.isnan(r), r != 0)]

# Sample tri0.shape[0] * tri0.shape[1] values at each iteration, but only
# keep values in upper left portion of triangle. Use mask to determine 
# which values to retain.
mask = ~np.isnan(tri0)

# Sample with replacement from adjusted residuals. 
s_r = rng.choice(r, size=mask.shape, replace=True)

# Replace 0s with nans.
s_r = (mask * s_r).replace(0, np.nan)

print("Sample with replacement from adjusted Pearson residuals:")

s_r
Sample with replacement from adjusted Pearson residuals:
12345678910
1981-50.38384-13.288597.1750930.5388626.88122-18.34683-58.4358433.0373510.489677.17509
1982-10.097848.25437-26.788340.4182878.02573-58.4358438.84427-30.3197219.66955NaN
1983-27.5011538.84427-58.43584-6.333007.17509-27.501150.41828-50.71956NaNNaN
1984-44.198438.05330-37.99576-27.5011538.84427-25.78033-30.31972NaNNaNNaN
198519.03085-3.94576-26.78834-2.9256438.886277.75140NaNNaNNaNNaN
198630.56811-27.5011526.881228.2543738.84427NaNNaNNaNNaNNaN
1987-39.70151-17.79550-18.34683-10.83539NaNNaNNaNNaNNaNNaN
198822.58802-18.004843.09575NaNNaNNaNNaNNaNNaNNaN
1989-25.78033-26.78834NaNNaNNaNNaNNaNNaNNaNNaN
19900.41828NaNNaNNaNNaNNaNNaNNaNNaNNaN

Using the sampled adjusted Pearson residuals and fitted incremental triangle , construct the triangle of sampled incremental losses :

where represents a sample with replacement from the adjusted Pearson residuals and the fitted incremental triangle:

tri_ii = tri + s_r * tri.map(np.sqrt)

print("Triangle of sampled incremental loss tri_ii:")

tri_ii
Triangle of sampled incremental loss tri_ii:
12345678910
1981-203.745173358.014414399.504674396.777823516.35018971.48866-847.525701393.77594495.73396266.10038
19821450.877364285.890891941.770932513.787295504.08697-749.648971622.34716-168.42480606.26754NaN
19831270.894308251.91283896.876893183.439183252.69020924.03021926.67577-638.60120NaNNaN
1984710.588777079.395103070.258242452.830865689.831691319.5114988.64530NaNNaNNaN
19854326.549136165.850193978.463484086.251265727.756353066.35794NaNNaNNaNNaN
19863615.421172552.744435807.317963327.028834194.70163NaNNaNNaNNaNNaN
1987218.816532855.856952602.033182069.59441NaNNaNNaNNaNNaNNaN
19883864.776544062.531495255.41826NaNNaNNaNNaNNaNNaNNaN
1989705.340741989.81179NaNNaNNaNNaNNaNNaNNaNNaN
19902081.99854NaNNaNNaNNaNNaNNaNNaNNaNNaN

Create a cumulative triangle, and project future losses using the chain ladder method:

# Create cumulative triangle from sampled incremental losses.
ctri_ii = to_cum(tri_ii)

# # Get age-to-age factors for sampled cumulative triangle.
a2a_ii = get_a2a_factors(ctri_ii)

# # Square ctri_ii, populating the lower-right side using a2a_ii.
ctri_ii_sqrd = square_tri(ctri_ii, a2a_ii)

print("Completed sampled triangle ctri_ii_sqrd:")

ctri_ii_sqrd
Completed sampled triangle ctri_ii_sqrd:
12345678910
1981-203.745173154.269257553.7739211950.5517315466.9019216438.3905815590.8648816984.6408317480.3747917746.47517
19821450.877365736.768257678.5391910192.3264815696.4134514946.7644816569.1116416400.6868417006.9543817265.84797
19831270.894309522.8071310419.6840213603.1232016855.8134017779.8436118706.5193818067.9181918664.3141018948.43736
1984710.588777789.9838810860.2421113313.0729719002.9046620322.4161520411.0614520646.5050221328.0163621652.68865
19854326.5491310492.3993214470.8628018557.1140624284.8704227351.2283628055.8537628379.4807429316.2464629762.52204
19863615.421176168.1656011975.4835615302.5123919497.2140220678.4336421211.1537521455.8264622164.0523422501.45144
1987218.816533074.673485676.706667746.3010710351.3688510978.4963911261.3256411391.2261211767.2340611946.36440
19883864.776547927.3080313182.7262817413.9551923270.2384924680.0430925315.8531225607.8739326453.1529026855.84416
1989705.340742695.152534093.686285407.627217226.202987663.995417861.436127952.118548214.606508339.65588
19902081.998547378.7320711207.6084214804.8883419783.7469320982.3258621522.8748821771.1433222489.7773622832.13491

So far we’ve accounted for parameter variance, but not process variance. In order to obtain the full prediction error, we need to incorporate process variance into our estimates. This is accomplished by simulating incremental projected losses from a gamma distribution. For each cell in the lower right of the completed triangle, we randomly sample from a gamma distribution with mean equal to the projected incremental loss in that cell, and variance equal to the value in that cell times . For example, consider the following squared incremental triangle:

      1  2  3  4  5  6  7  8  9  10
1991  84 27  6  6  3  0  2 -1 -0  2
1992 109 33  7  2  4  4  1 -1  1  2
1993  86 28  8  4  3  2 -0  1  0  2
1994 113 32  1  4  3  2 -1 -0  0  2
1995  86 26  6  3  2  2  0 -0  0  2
1996 107 39  7  4  4  2  1 -0  0  2
1997  72 26  2  3  2  2  0 -0  0  1
1998  77 21  3  3  2  2  0 -0  0  1
1999  74 28  4  3  2  2  0 -0  0  1
2000  54 17  3  2  2  1  0 -0  0  1

Values to the right of the main diagonal represent projected future losses. For the loss at origin = 2000 and development period = 2, the projected incremental loss is 17. We would therefore sample from a gamma distribution parameterized as follows:

from numpy.random import gamma

# Computed above. 
phi = .798 

# Value at intersection of origin=2000 and development period = 2.
mu = 17

# Determine shape and scale from mean and variance.
shape = mu**2 / (phi * mu)
scale = (phi * mu) / mu

# Generate sample from gamma distribution.
rng.gamma(shape=shape, scale=scale, size=1)
# array([19.29149])

We take advantage of the fact that for the gamma distribution, the shape parameter and scale . In essence, we are simulating future incremental losses from a series a gamma distributions, each with parameterization based on the chain ladder-derived future incremental losses. To handle cases in which a projected incremental loss might be negative, we take the absolute value of the projected loss when determining and for origin period , development period , where and .

"""
Incorporation of process variance. 
"""

# pd.set_option('display.float_format', '{:.2f}'.format)

from numpy.random import gamma


# Get sampled squared incremental triangle.
tri_ii_sqrd = to_incr(ctri_ii_sqrd)

for r_idx in range(1, nbr_devps):

    for c_idx in range(nbr_devps - r_idx, nbr_devps):
        
        # Get mean and variance from incremental loss value.
        m = np.abs(tri_ii_sqrd.iat[r_idx, c_idx].item())
        v = m * phi

        # Determine shape and scale parameters. 
        shape = m**2 / v
        scale = v / m

        # Update value at [r_idx, c_idx] with sample from gamma distribution.
        tri_ii_sqrd.iat[r_idx, c_idx] = rng.gamma(shape=shape, scale=scale, size=1).item()


print("Sampled incremental triangle w/ process variance:")

tri_ii_sqrd
Sampled incremental triangle w/ process variance:
12345678910
1981-203.745173358.014414399.504674396.777823516.35018971.48866-847.525701393.77594495.73396266.10038
19821450.877364285.890891941.770932513.787295504.08697-749.648971622.34716-168.42480606.267540.00564
19831270.894308251.91283896.876893183.439183252.69020924.03021926.67577-638.60120234.756931.54909
1984710.588777079.395103070.258242452.830865689.831691319.5114988.645300.81938138.50517140.97113
19854326.549136165.850193978.463484086.251265727.756353066.35794375.7416820.52462624.28700115.89744
19863615.421172552.744435807.317963327.028834194.70163218.8131510.1639614.27299306.61617371.23650
1987218.816532855.856952602.033182069.594415116.03482160.27811418.300860.0000314.954521.75255
19883864.776544062.531495255.418261226.246333936.246912824.74743827.320614.3581384.72082387.91986
1989705.340741989.811791215.04136619.782603797.39110796.305728.327382.079080.2504383.89140
19902081.998543260.842691892.699352777.315151124.31564720.6907410.076387.82743299.717570.78279

From this point, we proceed exactly as if performing a standard chain ladder analysis: Cumulate incremental losses, then compute the total needed reserve as the ultimate projected value minus the latest cumulative loss amount by origin period:

ctri_ii_sqrd = to_cum(tri_ii_sqrd)

latest = get_latest(ctri_ii_sqrd)

ibnr = ctri_ii_sqrd.iloc[:, -1] - latest

print("IBNR for sampled triangle:")

ibnr
IBNR for sampled triangle:
1981        0.00000
1982        0.00564
1983      236.30602
1984      280.29567
1985     1136.45073
1986      921.10277
1987     5711.32089
1988     9291.56009
1989     6523.06907
1990    10094.26775
dtype: float64

The preceding steps are repeated for the desired number of bootstrap samples, resulting in the predictive distribution of total needed reserve by origin period and in aggregate.

Bringing it All Together

The steps outlined above are combined in the next cell to run 1000 bootstrap samples, generating the predictive distribution of reserves. We also present code to visualize the predictive distribution by origin period and in aggregate.

import matplotlib.pyplot as plt
import numpy as np
from numpy.random import gamma
import pandas as pd


# Set random seed for reproducibility.
rng = np.random.default_rng(seed=516)

# Number of bootstrap samples.
nbr_samples = 1000

# Load tabular incremental losses. Convert to incremental triangle. 
dfraa = pd.read_csv("https://gist.githubusercontent.com/jtrive84/976c80786a6e97cce7483e306562f85b/raw/06a5c8b1f823fbe2b6da15f90a672517fa5b4571/RAA.csv")
dfraa = dfraa.sort_values(by=["ORIGIN", "DEV"])
tri0 = dfraa.pivot(index="ORIGIN", columns="DEV").rename_axis(None)
tri0.columns = tri0.columns.droplevel(0)
nbr_devps = tri0.shape[1]
mask = ~np.isnan(tri0)

# Create cumulative triangle from original losses.
ctri0 = to_cum(tri0)

# All-year volume-weighted age-to-age factors.
a2a = get_a2a_factors(ctri0)

# Cumulative fitted triangle via backwards recursion.
ctri = pd.DataFrame(columns=tri0.columns, index=tri0.index, dtype=float)

for idx, origin in enumerate(ctri0.index):

    # Determine latest development period.
    latest_devp = nbr_devps - idx

    # Set latest diagonal of tri to same value as in tri0.
    ctri.at[origin, latest_devp] = ctri0.at[origin, latest_devp]

    # Use backward recursion to un-develop triangle using a2a. 
    for devp in range(latest_devp - 1, 0, -1):
        ctri.at[origin, devp] = ctri.at[origin, devp + 1] / a2a.iloc[devp - 1]

# Incremental fitted triangle.
tri = to_incr(ctri)

# Unscaled Pearson residuals.
r_us = (tri0 - tri) / tri.abs().map(np.sqrt)

# Degrees of freedom.
n = tri0.count().sum().item()
p = tri0.index.shape[0] + tri0.columns.shape[0] - 1
DF = n - p

# Scale parameter.
phi = ((r_us**2).sum().sum() / DF).item()

# Adjusted Pearson residuals.
r_adj = np.sqrt(n / DF) * r_us 

# Create sampling distribution from adjusted Pearson residuals. Remove
# nans and 0s. 
r = r_adj.iloc[:-1, :-1].astype(float).values.flatten()
r = r[np.logical_and(~np.isnan(r), r != 0)]


# Sample tri0.shape[0] * tri0.shape[1] values at each iteration, but only
# keep values in upper left portion of triangle. Use mask to determine 
# which values to retain.
sqrd_ctris = []

for ii in range(nbr_samples):

    # Sample with replacement from adjusted residuals. 
    s_r = rng.choice(r, size=mask.shape, replace=True)

    # Replace 0s with nans.
    s_r = (mask * s_r).replace(0, np.nan)

    # Sampled incremental triangle.
    tri_ii = tri + s_r * tri.map(np.sqrt)

    # Sampled cumulative triangle.
    ctri_ii = to_cum(tri_ii)

    # Age-to-age factors for sampled cumulative triangle.
    a2a_ii = get_a2a_factors(ctri_ii)

    # Sampled squared cumulative triangle.
    ctri_ii_sqrd = square_tri(ctri_ii, a2a_ii)

    # Sampled squared incremental triangle.
    tri_ii_sqrd = to_incr(ctri_ii_sqrd)

    # Incorporate process variance.
    for r_idx in range(1, nbr_devps):

        for c_idx in range(nbr_devps - r_idx, nbr_devps):
            
            # Get mean and variance from incremental loss value.
            m = np.abs(tri_ii_sqrd.iat[r_idx, c_idx].item())
            v = m * phi

            # Determine shape and scale parameters. 
            shape = m**2 / v
            scale = v / m

            # Update value at [r_idx, c_idx] with sample from gamma distribution.
            tri_ii_sqrd.iat[r_idx, c_idx] = rng.gamma(shape=shape, scale=scale, size=1).item()

    ctri_ii_sqrd2 = to_cum(tri_ii_sqrd)

    # Keep Sampled squared triangle.
    sqrd_ctris.append(ctri_ii_sqrd2)

Obtaining predictive distribution of reserves and ultimates from sqrd_ctris.

ultimates, reserves = [], []

for ii, ctri in enumerate(sqrd_ctris):

    ult = (
        ctri.iloc[:, -1]
        .to_frame()
        .reset_index(drop=False)
        .rename({"index": "origin", 10: "ult"}, axis=1)
    )

    ult["n"] = ii
    ultimates.append(ult)

    latest = get_latest(ctri)
    ibnr = (
        (ctri.iloc[:, -1] - latest).astype(float)
        .to_frame()
        .reset_index(drop=False)
        .rename({"index": "origin", 0: "ibnr"}, axis=1)
    )
    ibnr["n"] = ii
    reserves.append(ibnr)


dfults = pd.concat(ultimates).reset_index(drop=True)
dfibnr = pd.concat(reserves).reset_index(drop=True)

Using dfults and dfibnr, we create a summary of mean ultimate, mean IBNR, standard error of IBNR as well as 75th and 95th percentiles of the predictive distribution of reserves:

from functools import reduce

# Latest cumulative loss amount by origin.
latest = (
    get_latest(ctri0)
    .to_frame()
    .reset_index(drop=False)
    .rename({"index": "origin"}, axis=1)
)

# Mean ultimate by origin.
ult_mean = (
    dfults.groupby("origin")["ult"].mean()
    .to_frame()
    .reset_index(drop=False)
    .rename({"ult": "ult_mean"}, axis=1)
)

ibnr_mean =  (
    dfibnr.groupby("origin")["ibnr"].mean()
    .to_frame()
    .reset_index(drop=False)
    .rename({"ibnr": "ibnr_mean"}, axis=1)
)

# Standard error of reserve distribution by origin. 
ibnr_se = (
    dfibnr.groupby("origin")["ibnr"].std(ddof=1)
    .to_frame()
    .reset_index(drop=False)
    .rename({"ibnr": "ibnr_se"}, axis=1)
)

# 75th percentile of reserve distribution by origin. 
ibnr_75 = (
    dfibnr.groupby("origin")["ibnr"].quantile(.75)
    .to_frame()
    .reset_index(drop=False)
    .rename({"ibnr": "ibnr_75th"}, axis=1)
)

# 95th percentile of reserve distribution by origin. 
ibnr_95 = (
    dfibnr.groupby("origin")["ibnr"].quantile(.95)
    .to_frame()
    .reset_index(drop=False)
    .rename({"ibnr": "ibnr_95th"}, axis=1)
)


# Combine into a single DataFrame.
bcl_summary = reduce(
    lambda df1, df2: df1.merge(df2, on="origin", how="left"),
    (latest, ult_mean, ibnr_mean, ibnr_se, ibnr_75, ibnr_95)
)

# Set ult_mean for earliest origin period to latest.
bcl_summary.at[0, "ult_mean"] = bcl_summary.at[0, "latest"]

print("Boostrap chain ladder summary by origin:")


bcl_summary.round(0)
Boostrap chain ladder summary by origin:
originlatestult_meanibnr_meanibnr_seibnr_75thibnr_95th
0198118834.018834.00.00.00.00.0
1198216704.016656.0317.0625.0344.01511.0
2198323466.024708.01032.01182.01489.03412.0
3198427067.029353.02297.01881.03089.06149.0
4198526180.029833.03414.02300.04627.07669.0
5198615852.020111.04056.02354.05471.08261.0
6198712314.018474.05997.03381.07956.011957.0
7198813112.024511.011401.04810.014397.020429.0
819895395.016590.011195.06314.014628.022886.0
919902063.019705.017697.013470.025418.043102.0

While results by origin can be useful, typically actuaries are more interested in the aggregate view. To get aggregate results, we first group by simulation number, then proceed as in the prior cell:

# Aggregate bootstrap chain ladder results.

agg_ults = dfults.groupby("n")["ult"].sum()
agg_ibnr = dfibnr.groupby("n")["ibnr"].sum()

dsumm = {
    "latest": [latest["latest"].sum().item()],
    "ult_mean": [agg_ults.mean().item()],
    "ibnr_mean": [agg_ibnr.mean().item()],
    "ibnr_se": [agg_ibnr.std(ddof=1).item()],
    "ibnr_75th": [agg_ibnr.quantile(.75).item()],
    "ibnr_95th": [agg_ibnr.quantile(.95).item()]
}

bcl_summary_total = pd.DataFrame().from_dict(dsumm, orient="columns")
bcl_summary_total.index = ["total"]

print("Boostrap chain ladder summary in total:")

bcl_summary_total.round(0)
Boostrap chain ladder summary in total:
latestult_meanibnr_meanibnr_seibnr_75thibnr_95th
total160987.0218945.057408.019025.069557.091763.0

Visualizing Bootstrap Chain Ladder Results

We can visualize actuals and predictions together by origin out to ultimate with 90% prediction intervals for each forecast period. Starting with sqrd_tris, we transform the data to make it easier for plotting:


dflist = []

for tri in sqrd_ctris:

    dftri = (
        tri
        .reset_index(drop=False)
        .rename_axis(None, axis=1)
        .rename({"index": "origin"}, axis=1)
        .melt(id_vars="origin", var_name="dev", value_name="bcl_value")
    )

    dflist.append(dftri)


df = (
    pd.concat(dflist)
    .sort_values(["origin", "dev"])
    .reset_index(drop=True)
)


# Compute mean, 5th and 95th percentile of prediction interval for each forecast period.
df = (
    df
    .groupby(["origin", "dev"], as_index=False)["bcl_value"]
    .agg({
        "bcl_mean": lambda v: v.mean(), 
        "bcl_95th": lambda v: v.quantile(.95),
        "bcl_5th": lambda v: v.quantile(.05)
    })
)

# Attach actual values from original cumulative triangle.
dfctri0 = (
    ctri0
    .reset_index(drop=False)
    .rename_axis(None, axis=1)
    .rename({"index": "origin"}, axis=1)
    .melt(id_vars="origin", var_name="dev", value_name="actual_value")
)

df = df.merge(dfctri0, on=["origin", "dev"], how="left")

# If actual_value is nan, then dev is a prediction for that origin. Otherwise
# it is an actual value. 
df["actual_devp_ind"] = df["actual_value"].map(lambda v: 1 if not np.isnan(v) else 0)
df["value"] = df.apply(lambda r: r.bcl_mean if r.actual_devp_ind==0 else r.actual_value, axis=1)

df.tail(10)
origindevbcl_meanbcl_95thbcl_5thactual_valueactual_devp_indvalue
90199012007.642454475.13715-225.446982063.012063.00000
91199026411.9778616077.0933955.49594NaN06411.97786
921990310426.9152225457.57282480.91586NaN010426.91522
931990413338.0740832439.26914760.56725NaN013338.07408
941990515634.7019437986.93978916.68155NaN015634.70194
951990617448.7434841542.915121175.72687NaN017448.74348
961990718163.2215343856.273811217.23798NaN018163.22153
971990818840.4143845536.317441285.16070NaN018840.41438
981990919316.4173246515.627261348.81827NaN019316.41732
9919901019705.0395047359.465021398.32997NaN019705.03950

Actuals with forecasts by origin year with 90% prediction intervals:

import matplotlib as mpl
from matplotlib.ticker import MaxNLocator


fill_color = "#FFC04C"


# Assume 9 origin periods (no distribution of fully-developed oldest origin period)
indices = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]
origin_periods = tri0.index[1:].tolist()

fig, ax = plt.subplots(3, 3, figsize=(10, 8), tight_layout=True, sharex=True) 

for (ii, jj), origin in zip(indices, origin_periods):

    dforigin = df[df.origin==origin]

    ax[ii, jj].set_title(f"{origin}", fontsize=8)

    # Get last actual development period for origin.
    last_actual_devp = dforigin[dforigin.actual_devp_ind==1].dev.max()

    # Actual values.
    act_dev = dforigin[dforigin.actual_devp_ind==1].dev.tolist()
    act_val = dforigin[dforigin.actual_devp_ind==1].value.tolist()

    # Predicted values.
    pred_dev = [last_actual_devp] + dforigin[dforigin.actual_devp_ind==0].dev.tolist()
    pred_val = [act_val[-1]] + dforigin[dforigin.actual_devp_ind==0].value.tolist()

    # 5th and 95th percentiles.
    pred_5th = [act_val[-1]] +  dforigin[dforigin.actual_devp_ind==0].bcl_5th.tolist()
    pred_95th = [act_val[-1]] + dforigin[dforigin.actual_devp_ind==0].bcl_95th.tolist()

    ax[ii, jj].plot(
        pred_dev, pred_val, "o", markersize=7, color="#1d2951", markerfacecolor="#FFFFFF", 
        markeredgecolor="#1d2951", markeredgewidth=.35, linestyle="--", linewidth=1., label="predicted"
    )

    ax[ii, jj].plot(
        act_dev, act_val, "o", markersize=7, color="#1d2951", markerfacecolor="#1d2951", 
         markeredgecolor="#1d2951", markeredgewidth=.35, linestyle="-", linewidth=1., label="actual"
    )

    ax[ii, jj].plot(
        pred_dev, pred_95th, color="#000000", linestyle=":",  # color="#FFFFB2",
        linewidth=.75, label="95th percentile"
    )

    ax[ii, jj].plot(
        pred_dev, pred_5th, color="#000000", linestyle="-.",  # color="#FFFFB2",
        linewidth=.75, label="5th percentile"
    )

    ax[ii, jj].fill_between(pred_dev, pred_5th, pred_95th, color=fill_color, alpha=.50)
    ax[ii, jj].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))
    ax[ii, jj].get_yaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
    ax[ii, jj].tick_params(axis="x", which="major", direction='in', labelsize=7)
    ax[ii, jj].tick_params(axis="y", which="major", direction='in', labelsize=7)
    ax[ii, jj].xaxis.set_ticks_position("none")
    ax[ii, jj].yaxis.set_ticks_position("none")
    ax[ii, jj].grid(True)   
    ax[ii, jj].set_axisbelow(True) 
    ax[ii, jj].legend(loc="upper left", fancybox=True, framealpha=1, fontsize="x-small")

plt.suptitle("Bootstrap chain ladder forecasts with 90% prediction interval", fontsize=10, weight="bold")

plt.show()

As expected, the prediction intervals grow wider for origin periods with fewer development periods of actual data to account for the greater uncertainty in ultimate projections.

Second, an exhibit with a separate histogram per facet can be used to visualize the distribution of IBNR generated by the bootstrap chain ladder:

import matplotlib as mpl
import matplotlib.pyplot as plt


# Color for each histogram.
hist_color = "#5473ff"

# Assume 9 origin periods (no distribution of fully-developed oldest origin period)
indices = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]
origin_periods = tri0.index[1:].tolist()

fig, ax = plt.subplots(3, 3, figsize=(10, 7), tight_layout=True) 

for (ii, jj), origin in zip(indices, origin_periods):
    ax[ii, jj].set_title(str(origin), fontsize=8, weight="normal")
    ax[ii, jj].hist(
        dfibnr[dfibnr.origin==origin].ibnr, 20, density=True, alpha=1, 
        color=hist_color, edgecolor="#FFFFFF", linewidth=1.0
        )
    
    # ax[ii, jj].yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))
    ax[ii, jj].set_yticklabels([])
    ax[ii, jj].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))
    ax[ii, jj].tick_params(axis="x", which="major", direction='in', labelsize=6)
    ax[ii, jj].tick_params(axis="x", which="minor", direction='in', labelsize=6)
    ax[ii, jj].tick_params(axis="y", which="major", direction='in', labelsize=6)
    ax[ii, jj].tick_params(axis="y", which="minor", direction='in', labelsize=6)
    ax[ii, jj].xaxis.set_ticks_position("none")
    ax[ii, jj].yaxis.set_ticks_position("none")
    ax[ii, jj].grid(True)   
    ax[ii, jj].set_axisbelow(True) 

plt.suptitle("Boostrap chain ladder: IBNR by origin", fontsize=9, weight="bold")
plt.show()

Finally, we can create a similar exhibit for the aggregate distribution of IBNR, with vertical lines added at the 50th, 75th, 95th and 99th percentile of total needed reserve:

hist_color = "#5473ff"

dfibnr_total = dfibnr.groupby("n", as_index=False)["ibnr"].sum()
ibnr_total = dfibnr_total["ibnr"].values
ibnr_50 = np.quantile(ibnr_total, .50).item()
ibnr_75 = np.quantile(ibnr_total, .75).item()
ibnr_95 = np.quantile(ibnr_total, .95).item()
ibnr_99 = np.quantile(ibnr_total, .99).item()

fig, ax = plt.subplots(1, 1, figsize=(7.5, 5.25), tight_layout=True) 

ax.set_title("Bootstrap chain ladder: total IBNR", fontsize=10, weight="bold")

ax.hist(
    ibnr_total, 27, density=True, alpha=1, color=hist_color, 
    edgecolor="#FFFFFF", linewidth=1.0
    )

# 50th percentile.
ax.axvline(ibnr_50, color="#000000", linewidth=1.25, linestyle="--")
ax.annotate(
    r"$p_{{50}} = {:,.0f}$".format(ibnr_50), xycoords="data", xy=(ibnr_50, 2e-5),
    fontsize=11, rotation=90, weight="normal", color="#000000", xytext=(10, 0), 
    textcoords="offset pixels"
)

# 75th percentile.
ax.axvline(ibnr_75, color="#000000", linewidth=1.25, linestyle="--")
ax.annotate(
    r"$p_{{75}} = {:,.0f}$".format(ibnr_75), xycoords="data", xy=(ibnr_75, 2e-5),
    fontsize=11, rotation=90, weight="normal", color="#000000", xytext=(10, 0), 
    textcoords="offset pixels"
)

# 95th percentile.
ax.axvline(ibnr_95, color="#000000", linewidth=1.25, linestyle="--")
ax.annotate(
    r"$p_{{95}} = {:,.0f}$".format(ibnr_95), xycoords="data", xy=(ibnr_95, 2e-5),
    fontsize=11, rotation=90, weight="normal", color="#000000", xytext=(10, 0), 
    textcoords="offset pixels"
)

# 99th percentile.
ax.axvline(ibnr_99, color="#000000", linewidth=1.25, linestyle="--")
ax.annotate(
    r"$p_{{99}} = {:,.0f}$".format(ibnr_99), xycoords="data", xy=(ibnr_99, 2e-5),
    fontsize=11, rotation=90, weight="normal", color="#000000", xytext=(10, 0), 
    textcoords="offset pixels"
)

# ax[ii, jj].yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))
ax.set_yticklabels([])
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))
ax.tick_params(axis="x", which="major", direction='in', labelsize=8)
# ax.tick_params(axis="x", which="minor", direction='in', labelsize=8)
ax.tick_params(axis="y", which="major", direction='in', labelsize=8)
# ax.tick_params(axis="y", which="minor", direction='in', labelsize=8)
ax.xaxis.set_ticks_position("none")
ax.yaxis.set_ticks_position("none")
ax.grid(True)   
ax.set_axisbelow(True) 
plt.show()

Existing third-party Python libraries such as trikit expose a number of models that can be used to estimate outstanding claim liabilities, but it can be helpful to see in detail how these estimates are obtained. Other loss reserving techniques will be explored in future posts.

To leave a comment for the author, please follow the link and comment on their blog: The Pleasure of Finding Things Out: A blog by James Triveri .

Want to share your content on python-bloggers? click here.