In this article, we are going to understand the absolute fundamentals, along with intuitions of what K-Nearest Neighbours (K-NN) is all about.
What is K-Nearest Neighbours ?
K-NN, or K-Nearest Neighbours, is a supervised learning algorithm that can be used for Classification and Regression problems.
What is classification ? You have 10 bottles of wines. You are supposed to classify them into red and white. This can be done using software, and is known as a Classification Problem.
What is regression ? Simply put, regression is a technique to predict. The prediction could be the future, the price of a house, the quantity of units being sold today on Amazon. Regression in simple words means to predict in the space of Machine Learning. Don’t let these gurus confuse you!
The term K-Nearest Neighbours means the following:
1. K – An integer value that we choose to tell the algorithm
2. Neareast Neighbours – We use the value from K above, say for example K=5, and ask our algorithm to tell us the 3 Nearest Neighbours to a certain data point wih coordinate say (X,Y)
K-NN Classification: A strong intuition
To begin with it is important we touch the topic of K-NN classification, and then once the intuition is understood, we slowly move over to K-NN regression in another post.
To make things clearer in perspective, like I always prefer, the intuition of this algorithm is absolutely essential. Let’s take a look at this image below.
In the above graph we see two categories, Green and Blue. Now, the ones in Blue when dropped do not break, but the ones in Green, when dropped, break!
Now the question is what if I have one glass with a specific hardness and it is dropped from a certain height. We would like to predict if this glass will break or not break. Take a look at the image below.
Here, the point on the scatter plot, with a purple outline, is to be categorised as a glass which will break or not. Simple ? Maybe yes, maybe no. Let’s find out together 🙂
The process behind K-NN
The process is simple, and memorise it, will only help you grasp the logic better. Four main steps for you. Ecco qua!
The problem is question whether a glass of a specific hardness, if dropped from a specific height, will break or not – is placed into the scatter plot with other glasses.
The distance is measured from this point to ALL other points on the scatter plot. The distance is measured using the Euclidean distance formula, which is:
If the points (x1,y1) and (x2,y2) are in 2-dimensional space, then the Euclidean distance between them is
All the distances are first sorted in ascending order.
We then choose K = 5, which is an number we choose as part of the algorithm. It tells the algorithm to find the K number of nearest neighbours from the PURPLE point. In the diagram on the right we have marked the SHORTEST FIVE DISTANCES in RED, since the value of K we have chosen is 5.
In the image below, I have illustrated the most important steps next:
1. The distances are sorted in ascending order, an the top five distances are chosen since we have selected K=5.
2. The algorithm then checks for the highest frequency of the classes. In this case it is Green.
3. Since the highest frequency is Green, the algorithm decides, ok “I think that the glass will break since the highest frequency is green!“
How do I choose the right value of K?
To select the appropriate value of K, we have to run the KNN algorithm multiple times, with multiple values of K. Whichever value of K gives us the best percentage of accuracy and/or minimum errors, then that is the right choice of K. 🙂
1. With decreasing value of K – predictions become less stable. Example: K=3
2. With increasing value of K – predictions become better, and more stable. This is mainly because of higher voting capacity.
3. Always keep the value of K as an odd number. Helps reduce split decisions.
K-NN is among the simplest classification algorithm on the market right now. It is important to begin with such simple classification algorithms to understand the absolute basics of Machine Learning. I hope you enjoyed reading this, and in case you have any questions – please ask your doubts, in the comments below.