# Gradient Descent for Logistic Regression

*This article was first published on*

Want to share your content on python-bloggers? click here.

**The Pleasure of Finding Things Out: A blog by James Triveri**, and kindly contributed to python-bloggers. (You can report issue about the content on this page here)Want to share your content on python-bloggers? click here.

Within the GLM framework, model coefficients are estimated using iterative reweighted least squares (IRLS), sometimes referred to as Fisher Scoring. This works well, but becomes inefficient as the size of the dataset increases: IRLS relies on the matrix of second partial derivatives which is expensive to compute. Instead, we can estimate logistic regression coefficients using gradient descent, which only relies on the first derivative of the cost function. This is much more efficient to compute, and generally provides good estimates once features have been standardized.

Expression for linear predictions:

Expression for predicted probabilities:

Gradient descent:

- is the learning rate.
- represents the gradients of the loss function.

Optimal parameters:

The loss function is used to determine how much predictions differs from labels. Here we use binary cross entropy loss:

which we obtain from the log-likelihood of a single observation assumed to follow a binomial distribution. We flip the sign (the leading `-`

) in order to minimize the loss. Note that .

Expression for gradient of loss function:

We choose that maximizes the log-probability of the true y labels in the training data given the observations . The goal is to find the set of weights which minimize the loss function, averaged over all examples. For each variable in , the gradient will have a component that tells us the slope w.r.t. that variable.

The breast cancer dataset is loaded from scikit-learn, the features are standardized and model coefficients estimated using gradient descent. Results are then compared with statsmodels coefficient estimates using the same data to ensure correctness.

""" Load breast cancer dataset from scikit-learn. Standardize features. """ import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.datasets import load_breast_cancer from sklearn.metrics import accuracy_score from sklearn.preprocessing import StandardScaler data_loader = load_breast_cancer() X_data, y_data = data_loader.data, data_loader.target column_names = data_loader.feature_names df = pd.DataFrame(X_data, columns=column_names) keep_features = [ 'mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness' ] X = df[keep_features].values y = y_data[:,None] # Standardize features. sclr = StandardScaler() X = sclr.fit_transform(X) # Add bias term. bias = np.ones(X.shape[0])[:, None] X = np.concatenate([bias, X], axis=1)

`eta`

represents the learning rate and is a hyperparameter. `num_epochs`

can be set as desired. For this example, we havenâ€™t incorporated early stopping logic, but this would be straightforward to implement. We initialize the weight vector `w`

to all 0s. `w`

has the same length as the number of features + 1 for the intercept term.

eta = .50 num_epochs = 50000 w = np.zeros((X.shape[1], 1)) loss = [] for _ in range(num_epochs): z = X @ w ypred = 1. / (1. + np.exp(-z)) L = -np.mean(y * np.log(ypred) + (1 - y) * np.log(1 - ypred)) dL = np.mean(((ypred - y) * X), axis=0)[:, None] loss.append(L) w-=eta * dL print(f"w: {w.ravel()}")

w: [ -0.4592425 22.09479813 -1.56463209 -14.7402888 -14.68918055 -1.66460167]

Estimating the coefficients using statsmodels yields:

import statsmodels.formula.api as smf df2 = pd.DataFrame(X[:,1:], columns=keep_features) df2["target"] = y df2.columns = [ii.replace(" ", "_") for ii in df2.columns] features = " + ".join([ii for ii in df2.columns if ii!="target"]) mdl_expr = "target ~ " + features mdl = smf.logit(mdl_expr, data=df2).fit() mdl.params

Optimization terminated successfully. Current function value: 0.148702 Iterations 10

Intercept -0.459246 mean_radius 22.094852 mean_texture -1.564632 mean_perimeter -14.740317 mean_area -14.689213 mean_smoothness -1.664601 dtype: float64

Which matches closely with the results produced via gradient descent.

**leave a comment**for the author, please follow the link and comment on their blog:

**The Pleasure of Finding Things Out: A blog by James Triveri**.

Want to share your content on python-bloggers? click here.