Want to share your content on python-bloggers? click here.
Imagine you are responsible for maintaining the mean and variance of a dataset that is frequently updated. For small-to-moderately sized datasets, much thought might not be given to the method used for recalculation. However, with datasets consisting of hundreds of billions or trillions of observations, full recomputation of the mean and variance at each refresh may require significant computational resources that may not be available.
Fortunately it isn’t necessary to perform a full recalculation of mean and variance when accounting for new observations. Recall that for a sequence of
A new observation
Demonstration
Consider the following values:
The mean and variance for the observations:
A new value,
The mean and variance calculated using online update results in:
confirming agreement between the two approaches.
Note that the variance returned using the online update formula is the population variance. In order to return the updated unbiased sample variance, we need to multiply the variance returned by the online update formula by
Implementation
A straightforward implementation in Python to handle online mean and variance updates, incorporating Bessel’s correction to return the unbiased sample variance is provided below:
import numpy as np def online_mean(mean_init, n, new_obs): """ Return updated mean in light of new observation without full recalculation. """ return((n * mean_init + new_obs) / (n + 1)) def online_variance(var_init, mean_new, n, new_obs): """ Return updated variance in light of new observation without full recalculation. Includes Bessel's correction to return unbiased sample variance. """ return ((n + 1) / n) * (((n * var_init) / (n + 1)) + (((new_obs - mean_new)**2) / n)) a0 = np.array([1154, 717, 958, 1476, 889, 1414, 1364, 1047]) a1 = np.array([1154, 717, 958, 1476, 889, 1414, 1364, 1047, 1251]) # Original mean and variance. mean0 = a0.mean() # 1127.38 variance0 = a0.var() # 65096.48 # Full recalculation mean and variance with new observation. mean1 = a1.mean() # 1141.11 variance1 = a1.var(ddof=1) # 59372.99 # Online update of mean and variance with bias correction. mean2 = online_mean(mean0, a0.size, 1251) # 1141.11 variance2 = online_variance(variance0, mean2, a0.size, 1251) # 66794.61 print(f"Full recalculation mean : {mean1:,.5f}") print(f"Full recalculation variance: {variance1:,.5f}") print(f"Online calculation mean : {mean2:,.5f}") print(f"Online calculation variance: {variance2:,.5f}")
Full recalculation mean : 1,141.11111 Full recalculation variance: 66,794.61111 Online calculation mean : 1,141.11111 Online calculation variance: 66,794.61111
Want to share your content on python-bloggers? click here.