Illustrating The Central Limit Theorem

Posted on January 7, 2021 by Michael Mayer in Data science | 0 Comments

This article was first published on Python – Michael's and Christian's Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Lost in Translation between R and Python 1

This is the first article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

Let’s start with a little bit of statistics – it wont be the last time, friends: Illustrating the Central Limit Theorem (CLT).

Take a sample of a random variable X with finite variance. The CLT says: No matter how “unnormally” distributed X is, its sample mean will be approximately normally distributed, at least if the sample size is not too small. This classic result is the basis to construct simple confidence intervals and hypothesis tests for the (true) mean of X, check out Wikipedia for a lot of additional information.

The code below illustrates this famous statistical result by simulation, using a very asymmetrically distributed X, namely X = 1 with probability 0.2 and X=0 otherwise. X could represent the result of asking a randomly picked person whether he smokes. Conducting such a poll, the mean of the collected sample of such results would be a statistical estimate of the proportion of people smoking.

Curiously, by a tiny modification, the same code will also illustrate another key result in statistics – the Law of Large Numbers: For growing sample size, the distribution of the sample mean of X contracts to the expectation E(X).

Python

# Fix seed, set constants
set.seed(2006)
sample_sizes <- c(1, 10, 30, 1000)
nsims <- 10000

# Helper function: Mean of one sample of X
one_mean <- function(n, p = c(0.8, 0.2)) {
  mean(sample(0:1, n, replace = TRUE, prob = p))
}
# one_mean(10)

# Simulate and plot
par(mfrow = c(2, 2), mai = rep(0.4, 4))

for (n in sample_sizes) {
  means <- replicate(nsims, one_mean(n))
  hist(means, breaks = "FD", 
       # xlim = 0:1, # uncomment for LLN
       main = sprintf("n=%i", n))
}

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Fix seed, set constants
np.random.seed(100)
sample_sizes = [1, 10, 30, 1000]
nsims = 10_000

# Helper function: Mean of one sample
def one_mean(n, p=0.2):
    return np.random.binomial(1, p, n).mean()

# Simulate and plot
fig, axes = plt.subplots(2, 2, figsize=(8, 8))

for i, n in enumerate(sample_sizes):
    means = [one_mean(n) for ell in range(nsims)]
    ax = axes[i // 2, i % 2]
    ax.hist(means, 50)
    ax.title.set_text(f'$n = {n}$')
    ax.set_xlabel('mean')
    # ax.set_xlim(0, 1)  # uncomment for LLN
fig.tight_layout()

Result: The Central Limit Theorem