# Estimating Logistic Regression coefficients From Scratch in Python

*This article was first published on*

Want to share your content on python-bloggers? click here.

**The Pleasure of Finding Things Out: A blog by James Triveri**, and kindly contributed to python-bloggers. (You can report issue about the content on this page here)Want to share your content on python-bloggers? click here.

In this post, I’ll demonstrate how to estimate the coefficients of a logistic regression model using the Fisher Scoring algorithm in Python. These estimates will be compared with statsmodels coefficients to ensure consistency.

In a generalized linear model (GLM), the response may have any distribution from the exponential family. Rather than assuming the mean is a linear function of the explanatory variables, we assume that a function of the mean, or the link function, is a linear function of the explanatory variables.

Logistic regression is used for modeling data with a categorical response. Although it’s possible to model multinomial data using logistic regression, in this post our analysis will be limited to models targeting a dichotomous response, where the outcome can be classified as ‘Yes/No’ or ‘1/0’.

The logistic regression model is a GLM whose canonical link is the logit, or log-odds:

for .

Solving the logit for , which is a stand-in for the predicted probability associated with observation , yields

where and .

### Parameter Estimation

Maximum Likelihood Estimation can be used to determine the parameters of a Logistic Regression model, which entails finding the set of parameters for which the probability of the observed data is greatest. The objective is to estimate the unknown .

Let represent independent, dichotomous response values for each of observations, where denotes a success and denotes a failure. The density function of a single observation is given by

and the corresponding likelihood function is

Taking the natural log of the maximum likelihood estimate results in the log-likelihood function:

The first-order partial derivatives of the log-likelihood are calculated and set to zero for each

which can be represented in matrix notation as

where is a (p + 1)-by-n matrix and an n-by-1 vector.

The vector of first-order partial derivatives of the log-likelihood function is referred to as the score function and is typically represented as .

These (p+1) equations are solved simultaneously to obtain the parameter estimates .

Each solution specifies a critical-point which will be either a maximum or a minimum. The critical point will be a maximum if the matrix of second partial derivatives is negative definite (which means every element on the diagonal of the matrix is less than zero).

The matrix of second partial derivatives is given by

represented in matrix form as

where is an n-by-n diagonal matrix of weights with each element equal to for logistic regression models (in general, the weights matrix will have entries inversely proportional to the variance of the response).

Since no closed-form solution exists for determining logistic regression model coefficients, iterative techniques must be employed.

### Fitting the Model

Two distinct but related iterative methods can be utilized in determining model coefficients: the Newton-Raphson method and Fisher Scoring. The Newton-Raphson method relies on the matrix of second partial derivatives, also known as the Hessian. The Newton-Raphson update expression is:

where:

- = the vector of updated coefficient estimates.
- = the vector of coefficient estimates from the previous iteration.
- = the inverse of the Hessian, .
- = the vector of first-order partial derivatives of the log-likelihood function, .

The Newton-Raphson method starts with an initial guess for the solution, and obtains a second guess by approximating the function to be maximized in a neighborhood of the initial guess by a second-degree polynomial, and then finding the location of that polynomial’s maximum value. This process continues until it converges to the actual solution. The convergence of to is usually fast, with adequate convergence frequently realized after fewer than 50 iterations.

An alternative method, *Fisher Scoring*, utilizes the expected information . Let serve as a stand-in for the expected value of the information:

The Fisher scoring update step replaces from Newton-Raphson with :

where:

- = the vector of updated coefficient estimates.
- = the vector of coefficient estimates from the previous iteration.
- = the inverse of the expected information matrix, .
- = the vector of first-order partial derivatives of the log-likelihood function, .

For GLMs with a canonical link (of which employing the logit for logistic regression is an example), the observed and expected information are the same. When the response follows an exponential family distribution, and the canonical link function is employed, observed and expected information coincide so that Fisher scoring and Newton-Raphson are identical.

When the canonical link is used, the second partial derivatives of the log-likelihood do not depend on the observation , and therefore

Fisher scoring has the advantage that it produces the asymptotic covariance matrix as a by-product.

To summarize:

- The
*Hessian*is the matrix of second partial derivatives of the log-likelihood with respect to the parameters, . - The
*observed information*is . - The
*expected information*is . - The
*asymptotic covariance matrix*is .

For models employing a canonical link function:

- The observed and expected information are the same, .
- , or .
- The Newton-Raphson and Fisher Scoring algorithms yield identical results.

### Fisher Scoring Implementation

The data used for our sample calculation can be obtained here. The data represents O-Ring failures in the 23 pre-Challenger space shuttle missions. TEMPERATURE will serve as the single explanatory variable which will be used to predict O_RING_FAILURE, which is 1 if a failure occurred, 0 otherwise.

Once the parameters have been determined, the model estimate of the probability of success for a given observation can be calculated with:

In the following code, we define a single function, `get_params`

, which returns the estimated model coefficients as a (p+1)-by-1 array. In addition, the function returns the number of scoring iterations, fitted values and the variance-covariance matrix for the estimated parameters.

import numpy as np def estimate_lr_params(X, y, epsilon=.001): """ Estimate logistic regression coefficients using Fisher Scoring.Iteration ceases once changes between elements in coefficient matrix across consecutive iterations is less than epsilon. - design_matrix `X` : n-by-(p+1) - response_vector `y` : n-by-1 - probability_vector `p` : n-by-1 - weights_matrix `W` : n-by-n - epsilon : threshold above which iteration continues - n : Number of observations - (p + 1) : Number of parameters (+1 for intercept term) - U: First derivative of log-likelihood with respect to each beta_i, i.e. "Score Function" = X^T * (y - p) - I: Second derivative of log-likelihood with respect to each beta_i, i.e. "Information Matrix" = (X^T * W * X) - X^T*W*X results in a (p + 1)-by-(p + 1) matrix. - X^T(y - p) results in a (p+1)-by-1 matrix. - (X^T*W*X)^-1 * X^T(y - p) results in a (p + 1)-by-1 matrix. Returns ------- dict of model results """ def sigmoid(v): return (1 / (1 + np.exp(-v))) betas0 = np.zeros(X.shape[1]).reshape(-1, 1) p = sigmoid(X @ betas0) W = np.diag((p * (1 - p)).ravel()) I = X.T @ W @ X U = X.T @ (y - p) n_iter = 0 while True: n_iter+=1 betas = betas0 + np.linalg.inv(I) @ U betas = betas.reshape(-1, 1) if np.all(np.abs(betas - betas0) < epsilon): break else: p = sigmoid(X @ betas) W = np.diag((p * (1 - p)).ravel()) I = X.T @ W @ X U = X.T @ (y - p) betas0 = betas dresults = { "params": betas.ravel(), "ypred": sigmoid(X @ betas), "V": np.linalg.inv(I), "n_iter": n_iter } return(dresults)

We read in the Challenger dataset, partition the data into the design matrix and response vector, which are then passed to `estimate_lr_params`

:

import pandas as pd df = pd.read_csv("https://gist.githubusercontent.com/jtrive84/835514a76f7afd552c999e4d9134baa8/raw/6dac51b80f892ef051174a46766eb53c7b609ebd/Challenger.csv") X0 = df[["TEMPERATURE"]].values X = np.concatenate([np.ones(X0.shape[0]).reshape(-1, 1), X0], axis=1) y = df[["O_RING_FAILURE"]].values dresults = estimate_lr_params(X, y) dresults

{'params': array([15.04290163, -0.23216274]), 'ypred': array([[0.43049313], [0.22996826], [0.27362106], [0.32209405], [0.37472428], [0.1580491 ], [0.12954602], [0.22996826], [0.85931657], [0.60268105], [0.22996826], [0.04454055], [0.37472428], [0.93924781], [0.37472428], [0.08554356], [0.22996826], [0.02270329], [0.06904407], [0.03564141], [0.08554356], [0.06904407], [0.82884484]]), 'V': array([[ 5.44406534e+01, -7.96333573e-01], [-7.96333573e-01, 1.17143602e-02]]), 'n_iter': 5}

`estimate_lr_params`

returns a dictionary consisting of the following keys:

`"params"`

: Estimated parameters.`"ypred"`

: Fitted values.`"V"`

: Variance-covariance matrix of the parameter estimates.`"n_iter"`

: Number of Fisher scoring iterations.

For the Challenger dataset, our implementation of Fisher scoring results in a model with and . In order to predict new probabilities of O-Ring Failure based on temperature, we use:

Negative coefficients correspond to features that are negatively associated with the probability of a positive outcome, with the reverse being true for positive coefficients.

Lets compare the results of our implementation against the estimates produced by statsmodels:

import statsmodels.formula.api as smf mdl = smf.logit("O_RING_FAILURE ~ TEMPERATURE", data=df).fit() print(f"\nmdl.params:\n{mdl.params}\n") print(f"mdl.cov_params():\n{mdl.cov_params()}\n") print(f"mdl.predict(df):\n{mdl.predict(df)}\n")

Optimization terminated successfully. Current function value: 0.441635 Iterations 7 mdl.params: Intercept 15.042902 TEMPERATURE -0.232163 dtype: float64 mdl.cov_params(): Intercept TEMPERATURE Intercept 54.444275 -0.796387 TEMPERATURE -0.796387 0.011715 mdl.predict(df): 0 0.430493 1 0.229968 2 0.273621 3 0.322094 4 0.374724 5 0.158049 6 0.129546 7 0.229968 8 0.859317 9 0.602681 10 0.229968 11 0.044541 12 0.374724 13 0.939248 14 0.374724 15 0.085544 16 0.229968 17 0.022703 18 0.069044 19 0.035641 20 0.085544 21 0.069044 22 0.828845 dtype: float64

The values produced using the statsmodels align closely with the results from `estimate_lr_params`

.

A feature of logistic regression models is that the predictions preserve the data’s marginal probabilities. If you aggregate the fitted values from the model, the total will equal the number of positive outcomes in the original target vector:

# estimate_lr_params. dresults["ypred"].sum()

7.000000000274647

# statsmodels. mdl.predict(df).sum()

7.0000000000000036

We have 7 positive instances in our dataset, and the total probability aggregates to 7 in both sets of predictions.

**leave a comment**for the author, please follow the link and comment on their blog:

**The Pleasure of Finding Things Out: A blog by James Triveri**.

Want to share your content on python-bloggers? click here.