K Nearest Neighbors is one of the simplest, if not the simplest, machine learning algorithms. It is a classification algorithm that makes predictions based on a defined number of nearest instances.
Today you’ll get your hands dirty by implementing and tweaking the K nearest neighbors algorithm from scratch. This is the fourth of many upcoming from-scratch articles, so stay tuned to the blog if you want to learn more. The links to the previous articles are located at the end of this piece.
The article is structured as follows:
- Introduction to K Nearest Neighbors
- Math Behind K Nearest Neighbors
- From-Scratch Implementation
- K Optimization
- Comparison with Scikit-Learn
You can download the corresponding notebook here.
Introduction to K Nearest Neighbors
As said earlier, K Nearest Neighbors is one of the simplest machine learning algorithms to implement. Its classification for a new instance is based on the target labels of K nearest instances, where K is a tunable hyperparameter.
Not only that, but K is the only mandatory hyperparameter. Changing its value can lead to an increase or decrease in model performance, depending on the data. You’ll learn how to find optimal K for any dataset today.
The strangest thing about the algorithm is that it doesn’t require any learning – but only a simple distance calculation. The choice of distance metric is up to you, but the most common ones are Euclidean and Cosine distances. We’ll work with Euclidean today.
When it comes to gotchas of using this algorithm, keep in mind it’s distance-based. Because of that, it might be a good idea to standardize your data before training.
And that’s pretty much all when it comes to the theory. Let’s talk briefly about the math behind before jumping into the code.
Math Behind K Nearest Neighbors
The distance calculation boils down to a single and simple formula. You aren’t obligated to use Euclidean distance, so keep that in mind.
The distance is calculated as a square root of squared differences between corresponding elements of two vectors. The vectors have to be of the same size, of course. Here’s the formula:
The formula can be written in a more compact way, as shown below:
So keep in mind that these two mean the same.
And that’s the whole logic you need to implement in Python! Let’s do that next.
Let’s start with the imports. We’ll need Numpy, Pandas, and Scipy for the logic and Matplotlib for visualization:
We’ll now declare a class called
KNN having the Scikit-Learn API syntax in mind. The class will have the following methods:
__init__(k)– the constructor, stores the value for the number of neighbors (default is 3) and for the training data, which is initially set to None
_euclidean_distance(p, q)– implements the formula from above
fit(X, y)– does basically nothing, just stores training data to the constructor
predict(X)– calculates distances between every row in
Xand every row in
KNN.X_train(available after calling
fit()). The distances are then sorted, and only top k are kept. The classification is then made by calculating a statistical mode.
The only reason we’re implementing the
fit() method is because we want to have the same API as Scikit-Learn. You’re free to remove it and do everything in the
Anyhow, here’s the code for the entire class:
Let’s test the algorithm next. We’ll use the Breast cancer dataset from Scikit-Learn. The following code snippet loads it and makes a train/test split in an 80:20 ratio:
Let’s “train” the model now and obtain predictions. You can use the following code snippet to do so:
If you were to print out
preds, here’s what you would see:
And here are the actual classes (
As you can see, the arrays are quite similar but have differences in a couple of places. A simple accuracy will tell us the percentage of instances classified correctly:
Here’s the corresponding accuracy:
Is 93% the best we can get? Unlikely. Let’s explore how you can tweak the value of K and find the optimal one in a range.
It’s highly unlikely that the default K value (3) is the best one. Luckily, it’s easy to do hyperparameter optimization for this simple algorithm. All we have to do is to train the model for some number of K values and select the one where accuracy was the highest.
It’s worth mentioning that only odd numbers should be tested for K values – to avoid potential ties.
The following code snippet evaluates the model for every odd number between 1 and 15:
Let’s visualize the accuracies. The following code snippet makes a plot of K values on the X-axis and their corresponding accuracies on the Y-axis. The optimal one is shown in the title:
Here are the results:
K value of 11 seems to work the best with our dataset. You can now retrain the model having this in mind (
model = KNN(k=11).
Let’s compare performance with the Scikit-Learn model next.
Comparison with Scikit-Learn
We want to know if our model is any good, so let’s compare it with something we know works well — a
KNeighborsClassifier class from Scikit-Learn.
You can use the following snippet to import the model class, train the model, make predictions, and print the accuracy score:
The corresponding accuracy is shown below:
As you can see, the model from Scikit-Learn performs roughly the same, at least accuracy-wise. Keep in mind that we didn’t do any tuning here, which would probably bring the accuracy to above 98%.
And that’s all for today. Let’s wrap things up in the next section.
Today you’ve learned how to implement K nearest neighbors in Python entirely from scratch. Does that mean you should ditch the de facto standard machine learning libraries? No, not at all. Let me elaborate.
Just because you can write something from scratch doesn’t mean you should. Still, knowing every detail of how algorithms work is a valuable skill and can help you stand out from every other fit and predict data scientist.
Thanks for reading, and please stay tuned to the blog if you’re interested in more machine learning from scratch articles.
- Master Machine Learning: Simple Linear Regression From Scratch With Python
- Master Machine Learning: Multiple Linear Regression From Scratch With Python
- Master Machine Learning: Logistic Regression From Scratch With Python
- PyTorch + SHAP = Explainable Convolutional Neural Networks
- 3 Ways to Tune Hyperparameters of Machine Learning Models with Python
The post Master Machine Learning: K Nearest Neighbors From Scratch With Python appeared first on Better Data Science.