Bessel’s Correction

The Pleasure of Finding Things Out: A blog by James Triveri

1 year ago

This article was first published on The Pleasure of Finding Things Out: A blog by James Triveri , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Bessel’s correction is the use of instead of in the sample variance formula where is the number of observations in a sample. This method corrects the bias in the estimation of the population variance.

Recall that bias is defined as:

where represents the actual parameter value, and is an estimator of the parameter . A desirable property of an estimator is that its expected value equals the parameter being estimated, or . When this occurs, the estimator is said to be unbiased. Let represent the population variance, given by

To show that is a biased estimator for , let be a random sample with and . First, note that

and as a result

Rearranging the familiar expression for variance yields

and similarly,

Therefore

Thus,

and we conclude that is biased since . We now consider the sample variance :

and since , we conclude that is an unbiased estimator for .

Demonstration

An important property of an unbiased estimator of a population parameter is that if the sample statistic is evaluated for every possible sample and the average computed, the mean over all samples will exactly equal the population parameter. For a given population with mean and variance , if the sample variance (division by ) is computed for all possible permutations of the dataset, the average of the sample variances will exactly equal . This also demonstrates (indirectly) that division by would consistently underestimate the population variance.

We now attempt to verify this property on the following dataset:

The Python itertools module exposes a collection of efficient iterators that stream values on-demand based on various starting and/or stopping conditions. For example, the permutations implementation takes as arguments an iterable and the length of the permutation r. It returns all r-length permutations of elements from the iterable (itertools also exposes a combinations function that does the same for all r-length combinations). The product function generates the cartesian product of the specified iterables, and takes an optional repeat argument. From the documentation:

To compute the product of an iterable with itself, specify the number of repetitions with the optional repeat keyword argument. For example, product(A, repeat=4) means the same as product(A, A, A, A).

product is used to compute the average sample variance for all 2, 3 and 4-element permutations from , and the result is compared to the population variance. Before we begin, lets calculate the population mean and variance:

We now compute the average of the sample variance for all -element permutations from for :

"""
Demonstrating that the sample variance is an unbiased estimator 
of the population variance. 

Generate all possible 2, 3, 4 and 5-element permutations from 
[7, 9, 10, 12, 15], and determine the sample variance of each 
sample. The average of the sample variances will exactly equate 
to the population variance if the sample variance is an unbiased 
estimator of the population variance.
"""
import itertools
import numpy as np

v = [7, 9, 10, 12, 15]


# Verify that the average of the sample variance
# for all 2-element samples equates to 7.44.
s2 = list(itertools.product(v, repeat=2))
result2 = np.mean([np.var(ii, ddof=1) for ii in s2])

# Verify that the average of the sample variance
# for all 3-element samples equates to 7.44.
s3 = list(itertools.product(v, repeat=3))
result3 = np.mean([np.var(ii, ddof=1) for ii in s3])

# Verify that the average of the sample variance
# for all 4-element samples equates to 7.44.
s4 = list(itertools.product(v, repeat=4))
result4 = np.mean([np.var(ii, ddof=1) for ii in s4])

# Verify that the average of the sample variance
# for all 5-element samples equates to 7.44.
s5 = list(itertools.product(v, repeat=5))
result5 = np.mean([np.var(ii, ddof=1) for ii in s5])

print(f"result2: {result2}")
print(f"result3: {result3}")
print(f"result4: {result4}")
print(f"result5: {result5}")

result2: 7.44
result3: 7.4399999999999995
result4: 7.44
result5: 7.44

Since the sample variance is an unbiased estimator of the population variance, these results should come as no surprise, but it is an interesting demonstration nonetheless.

To leave a comment for the author, please follow the link and comment on their blog: The Pleasure of Finding Things Out: A blog by James Triveri .

Want to share your content on python-bloggers? click here.