A Beautiful Regression Formula

[This article was first published on Python – Michael's and Christian's Blog, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Lost in Translation between R and Python 4

Hello statistics aficionados

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

The last one was a deep dive into historic mortality rates.

No Covid-19, no public data for a change: This post focusses on a real beauty, namely a decomposition of the R-squared in a linear regression model

E(y) = \alpha + \sum_{j = 1}^p x_j \beta_j

fitted by least-squares. If the response y and all p covariables are standardized to variance 1 beforehand, then the R-squared can be obtained as the cross-product of the fitted coefficients and the usual correlations between each covariable and the response:

R^2 = \sum_{j = 1}^p \text{cor}(y, x_j)\hat\beta_j.

Two elegant derivations can be found in this answer to the same question, written by the number 1 contributor to crossvalidated: whuber. Look up a couple of his posts – and statistics will suddenly feel super easy and clear.

Direct consequences of the formula are:

  1. If a covariable is uncorrelated with the response, it cannot contribute to the R-squared, i.e. neither improve nor worsen. This is not obvious.
  2. A correlated covariable only improves R-squared if its coefficient is non-zero. Put differently: if the effect of a covariable is already fully covered by the other covariables, it does not improve the R-squared. This is somewhat obvious.

Note that all formulas refer to in-sample calculations.

Since we do not want to bore you with math, we simply demonstrate the result with short R and Python codes based on the famous iris dataset.

y <- "Sepal.Width"
x <- c("Sepal.Length", "Petal.Length", "Petal.Width")

# Scaled version of iris
iris2 <- data.frame(scale(iris[c(y, x)]))

# Fit model 
fit <- lm(reformulate(x, y), data = iris2)
summary(fit) # multiple R-squared: 0.524
(betas <- coef(fit)[x])
# Sepal.Length Petal.Length  Petal.Width 
#    1.1533143   -2.3734841    0.9758767 

# Correlations (scaling does not matter here)
(cors <- cor(iris[, y], iris[x]))
# Sepal.Length Petal.Length Petal.Width
#   -0.1175698   -0.4284401  -0.3661259
 
# The R-squared?
sum(betas * cors) # 0.524

# Import packages
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Load data
iris = datasets.load_iris(as_frame=True).data
print("The data:", iris.head(3), sep = "\n")

# Specify response
yvar = "sepal width (cm)"

# Correlations of everyone with response
cors = iris.corrwith(iris[yvar]).drop(yvar)
print("\nCorrelations:", cors, sep = "\n")

# Prepare scaled response and covariables
X = StandardScaler().fit_transform(iris.drop(yvar, axis=1))
y = StandardScaler().fit_transform(iris[[yvar]])

# Fit linear regression
OLS = LinearRegression().fit(X, y)
betas = OLS.coef_[0]
print("\nScaled coefs:", betas, sep = "\n")

# R-squared via scikit-learn: 0.524
print(f"\nUsual R-squared:\t {OLS.score(X, y): .3f}")

# R-squared via decomposition: 0.524
rsquared = betas @ cors.values
print(f"Applying the formula:\t {rsquared: .3f}")

Indeed: the cross-product of coefficients and correlations equals the R-squared of 52%.

The Python notebook and R code can be found at:

To leave a comment for the author, please follow the link and comment on their blog: Python – Michael's and Christian's Blog.

Want to share your content on python-bloggers? click here.