Logistic regression is the simplest classification algorithm you’ll ever encounter. It’s similar to the linear regression explored last week, but with a twist. More on that in a bit.
Today you’ll get your hands dirty by implementing and tweaking the logistic regression algorithm from scratch. This is the third of many upcoming from-scratch articles, so stay tuned to the blog if you want to learn more. The links to the previous articles are located at the end of this piece.
The article is structured as follows:
- Introduction to Logistic Regression
- Math Behind Logistic Regression
- Introduction to Binary Cross Entropy Loss
- From-Scratch Implementation
- Threshold Optimization
- Comparison with Scikit-Learn
You can download the corresponding notebook here.
Introduction to Logistic Regression
Logistic regression is a fundamental machine learning algorithm for binary classification problems. Nowadays, it’s commonly used only for constructing a baseline model. Still, it’s an excellent first algorithm to build because it’s highly interpretable.
In a way, logistic regression is similar to linear regression. We’re still dealing with a line equation for making predictions. This time, the results are passed through a Sigmoid activation function to convert real values to probabilities.
The probability tells you the chance of the instance belonging to a positive class (e.g., this customer has a 0.85 churn probability). These probabilities are then turned to actual classes based on a threshold value. If the probability is greater than the threshold, we assign the positive class and vice-versa.
The threshold value can (and should) be altered depending on the problem and the type of metric you’re optimizing for.
Let’s talk about assumptions of a logistic regression model:
- The observations (data points) are independent
- There is little to no multicollinearity among independent variables (check for correlation and remove for redundancy)
- Large sample size – a minimum of 10 cases with the least frequent outcome for each independent variable. For example, if you have five independent variables and the expected probability of the least frequency outcome is 0.1, then you need a minimum sample size of 500 (10 * 5 / 0.1)
Training a logistic regression model means calculating the best coefficients for weights and bias. These can be calculated through an iterative optimization process known as gradient descent. More on that in the next section.
Math Behind Logistic Regression
The math behind logistic regression is quite simple. We’re still dealing with a line equation:
But this time, the output of the line equation is passed through a Sigmoid (Logistic) function, shown in the following formula:
The role of a sigmoid function is to take any real value and map it to a probability – value between zero and one. It’s an S-shaped function, and you can use the following code to visualize it:
Here’s the visualization:
The value the sigmoid function returns is interpreted as a probability of the positive class. If the probability is larger than some threshold (commonly 0.5), we assign the positive class. If the probability is lower than the threshold, we assign the negative class.
As with linear regression, there are two parameters we need to optimize for – weights and bias. We’ll need to declare the cost function to perform the optimization. Unfortunately, the familiar mean squared error function can’t be used. Well, it can be used in theory, but it isn’t a good idea.
Instead, we’ll use a Binary Cross Entropy function, shown in the following formula:
Don’t worry if it looks like a foreign language, we’ll explain it in the next section.
Next, you’ll need to use this cost function in the optimization process to update weights and bias iteratively. To do so, you’ll have to calculate partial derivatives of the binary cross entropy function concerning weights and bias parameters:
The scalar can be omitted, as it doesn’t make any difference. Next, you’ll have to update the existing weights and bias according to the update rules – shown in the following formulas:
The alpha parameter represents the learning rate. The entire process is repeated for the desired number of iterations.
And that’s all with regards to the math! Let’s go over the binary cross entropy loss function next.
Introduction to Binary Cross Entropy Loss
Binary cross entropy is a common cost (or loss) function for evaluating binary classification models. It’s commonly referred to as log loss, so keep in mind these are synonyms.
This cost function “punishes” wrong predictions much more than it “rewards” good ones. Let’s see it in action.
Example 1 – Calculating BCE for a correct prediction
Let’s say your model predicts the positive class with a 90% probability (0.9). This means the model is only 10% confident the negative class should be predicted.
Question: What’s the BCE value?
As you can see, the BCE value is rather small, only 0.1. This is because the model was pretty confident in the prediction. Let’s see what happens if that’s not the case.
Example 2 – Calculating BCE for an incorrect prediction
Let’s say your model predicts the positive class with a 10% probability (0.1). This means the model is 90% confident the negative class should be predicted.
Question: What is the BCE value?
As you can see, the loss is quite big in this case – a perfect demonstration of how BCE punishes the wrong prediction much more than it rewards the good ones.
I’m not a big fan of doing math by hand. If the same applies to you, you’ll like this part. The following function implements BCE from scratch in Python:
We need the
safe_log() function because
log(0) equals infinity. Anyhow, you’ll see that our by-hand calculations were correct if you run this code.
You now know everything needed to implement a logistic regression algorithm from scratch. Let’s do that next.
Let the fun part begin! We’ll now declare a class called
LogisticRegression with the following methods:
__init__(learning_rate, n_iterations)– the constructor, contains the values for learning rate and the number of iterations, alongside the weights and bias (initially set to None)
_sigmoid(x)– logistic activation function, you know the formula
_binary_cross_entropy(y, y_hat)– our cost function – we implemented it earlier already
fit(X, y)– iteratively optimizes weights and bias through gradient descent. After the calculation is done, the results are stored in the constructor
predict_proba(X)– calculates the prediction probabilities using the line equations passed through a sigmoid activation function
predict(X, threshold)– calculates predicted classes (binary) based on the threshold parameter
If you understand the math behind logistic regression, implementation in Python should be an issue. It all boils down to around 70 lines of documented code:
Let’s test the algorithm next. We’ll use the Breast cancer dataset from Scikit-Learn. The following code snippet loads it, makes a train/test split in 80:20 ratio, instantiates the model, fits the data, and makes predictions:
In case you want to know, here are the values for the optimal weights (accessed through
And here’s the optimal bias (accessed through
This concludes the training portion. Let’s evaluate the model next.
We’ll keep things simple here and print only the accuracy score and the confusion matrix. You can use the following code snippet to do so:
Here’s the accuracy value:
And here’s the confusion matrix:
As you can see, the model works just fine with around 95% accuracy. There are six false negatives, meaning that in six cases model predicted “No” when the actual condition was “Yes”. Still, more than decent results.
Let’s explore how you can make the results even better by tweaking the classification threshold.
There’s no guarantee that 0.5 is the best classification threshold for every classification problem. Luckily, we can change the threshold by altering the
threshold parameter of the
The following code snippet optimizes the threshold for accuracy, but you’re free to choose any other metric:
Here’s how the threshold chart looks like:
The best threshold and the corresponding obtained accuracy are shown in the plot legend. As you can see, the threshold value is more or less irrelevant for this dataset, but that likely won’t be the case for other datasets.
You can now quickly retrain the model with the optimal threshold value in mind:
Here’s the new, improved accuracy score:
And here’s the confusion matrix:
Now you know how to train a custom classifier model and how to optimize the classification threshold. Let’s compare it to a Scikit-Learn model next.
Comparison with Scikit-Learn
We want to know if our model is any good, so let’s compare it with something we know works well — a
LogisticRegression class from Scikit-Learn.
You can use the following snippet to import the model class, train the model, make predictions, and print accuracy and confusion matrix:
Here’s the obtained accuracy score:
And here’s the confusion matrix:
As you can see, the model from Scikit-Learn performs roughly the same, at least accuracy-wise. There are some tradeoffs between false positives and false negatives, but in general, both models perform well.
Let’s wrap things up in the next section.
Today you’ve learned how to implement logistic regression in Python entirely from scratch. Does that mean you should ditch the de facto standard machine learning libraries? No, not at all. Let me elaborate.
Just because you can write something from scratch doesn’t mean you should. Still, knowing every detail of how algorithms work is a valuable skill and can help you stand out from every other fit and predict data scientist.
Thanks for reading, and please stay tuned to the blog if you’re interested in more machine learning from scratch articles.
- Master Machine Learning: Simple Linear Regression From Scratch With Python
- Master Machine Learning: Multiple Linear Regression From Scratch With Python
- PyTorch + SHAP = Explainable Convolutional Neural Networks
- 3 Ways to Tune Hyperparameters of Machine Learning Models with Python
- Python Parallelism: Essential Guide to Speeding up Your Python Code in Minutes
The post Master Machine Learning: Logistic Regression From Scratch With Python appeared first on Better Data Science.