Want to share your content on python-bloggers? click here.
This post is intended to shed light on why the closed form solution to linear regression estimates is avoided in statistical software packages. But we start by first by deriving the solution to the normal equations within the standard multivariate regression setting.
The normal linear regression model specifies that the sampling variability around the mean is i.i.d. from a normal distribution:
Therefore the specification for the joint PDF of observed data conditional upon and values is given by:
Alternatively, the joint PDF can be represented in terms of the multivariate normal distribution. Let
Where
The density depends on
Computing the first derivative of the last expression above w.r.t.
Why isn’t this expression implemented in linear model solvers directly?
The condition number of a matrix is the ratio of maximum-to-minimum singular values (which, for a normal matrix, is the ratio of the maximum-to-minimum absolute value of eigenvalues). Essentially, the condition number tells you how much solving a linear system will magnify any noise in your data. It can be thought of as a measure of amplification. The smaller the condition number, the better (the best value being 1).
""" Demonstrating the equivalence of computing the condition number as the ratio of maximum-to-minimum singular values and using np.linalg.cond, as well as a comparison of the condition numbers for X vs. X^T*X. """ import numpy as np rng = np.random.default_rng(516) X = rng.normal(size=(50, 10)) # SVD for X. U0, S0, Vt0 = np.linalg.svd(X, full_matrices=True) c0 = np.linalg.cond(X, p=None) # SVD for X^T*X. U1, S1, Vt1 = np.linalg.svd(X.T @ X, full_matrices=True) c1 = np.linalg.cond(X.T @ X, p=None) # S0 and S1 represent the singular values of X and X^T*X. print(f"S0.max() / S0.min() : {S0.max() / S0.min():.8f}.") print(f"Condition number of X : {c0:.8f}.") print(f"S1.max() / S1.min() : {S1.max() / S1.min():.8f}.") print(f"Condition number of X^T*X : {c1:.8f}.")
S0.max() / S0.min() : 2.44498390. Condition number of X : 2.44498390. S1.max() / S1.min() : 5.97794628. Condition number of X^T*X : 5.97794628.
In terms of numerical precision, computing
If the condition number of
The QR Decomposition
The QR Decomposition factors a matrix
where we’ve taken advantage of how transpose distributes over matrix products (i.e.
Because
# Solving for regression coefficients using normal equations and QR decomposition. import numpy as np from scipy.linalg import solve_triangular rng = np.random.default_rng(516) X = rng.normal(size=(50, 5)) y = rng.normal(scale=5, size=50) # Normal equations solution. B0 = np.linalg.inv(X.T @ X) @ X.T @ y # QR Decomposition solution. Q, R = np.linalg.qr(X, mode="reduced") B1 = solve_triangular(R, Q.T @ y) print(f"B0: {B0}") print(f"B1: {B1}") print(f"np.allclose(B0, B1): {np.allclose(B0, B1)}")
B0: [ 0.42402765 -1.21951527 0.22396056 0.26773935 -0.72067314] B1: [ 0.42402765 -1.21951527 0.22396056 0.26773935 -0.72067314] np.allclose(B0, B1): True
Using the QR Decomposition, we’ve eliminated the need to explicitly create the Gram matrix, and no longer need to invert matrices, which is computationally expensive and has its own set of precision degradation issues.
The Singular Value Decomposition
The Singular Value Decomposition is a generalization of the Eigendecomposition to any
is a orthogonal matrix (assuming is real); columns represent left singular vectors. is a diagonal matrix with diagonal entries representing the singular values of . is a orthogonal matrix (assuming is real); rows represent right singular vectors.
Beginning with the normal equations solution, replace
Since
Determining the estimated coefficients using SVD can be accomplished as follows:
# Solving for regression coefficients using normal equations and SVD. import numpy as np rng = np.random.default_rng(516) X = rng.normal(size=(50, 5)) y = rng.normal(scale=5, size=50) # Normal equations solution. B0 = np.linalg.inv(X.T @ X) @ X.T @ y # SVD solution. U, S, Vt = np.linalg.svd(X, full_matrices=False) B1 = Vt.T @ np.diag(1 / S) @ U.T @ y print(f"B0: {B0}") print(f"B1: {B1}") print(f"np.allclose(B0, B1): {np.allclose(B0, B1)}")
B0: [ 0.42402765 -1.21951527 0.22396056 0.26773935 -0.72067314] B1: [ 0.42402765 -1.21951527 0.22396056 0.26773935 -0.72067314] np.allclose(B0, B1): True
Each of these methods (normal equations, QR Decomposition, SVD) incur different computational cost. From Golub & Van Loan, the flop count associated with each algorithm is presented below:
------------------------------------------- Normal Equations | mn^2 + n^3 / 3 | ------------------------------------------- Householder QR | n^3 / 3 | ------------------------------------------- Modified Gram-Schmidt | 2mn^2 | ------------------------------------------- SVD | 2mn^2 + 11n^3 | -------------------------------------------
Householder and Modified Gram-Schmidt are two approaches used in the first step of the QR Decomposition. SVD offers far more stability, but comes with added runtime complexity. Other matrix decomposition methods such as LU and Cholesky can be leveraged, but the QR Decomposition represents a good trade-off between runtime and numerical stability. This explains its wide use in statistical software packages. Check out A Deep Dive into how R Fits a Linear Model for an in-depth explanation of how the QR Decomposition is used to fit linear models in R.
Want to share your content on python-bloggers? click here.