K-Nearest Neighbours Simplified: From Math Theory to Practical Usage

Imagine you just moved into a new neighbourhood. You’re trying to figure out which pizza place delivers fastest. What do you do? You ask your three closest neighbours what they’ve experienced, and you go with the majority. Boom—you’ve just used the logic behind K-Nearest Neighbours (KNN).

What You'll Learn

What is K-Nearest Neighbours (KNN)?

KNN is a supervised machine learning algorithm that classifies data points based on their similarity to other points. In regression tasks, it predicts values by averaging the outcomes of the nearest neighbours.

Analogy: Think of KNN like seeking advice from your closest friends. If you're unsure about a decision, you consult a few friends whose opinions you trust (your "nearest neighbours"). Their collective input helps you make a choice.

A graphical example of a K-nearest neighbour

Imagine a scatter plot with different coloured dots representing various categories. When a new data point appears, KNN identifies the 'k' closest points and assigns the most common category among them to the new point.

What is K?

In K-Nearest Neighbours, K is just a number. It tells the algorithm how many nearby neighbours to look at when making a prediction.

For classification, it’s like asking, “What’s the most popular vote among my K neighbours?”
For regression, it means, “What’s the average value among my K neighbours?”

So, picking the right K is super important—because it decides how your model behaves.

Why K Matters

Choosing K = 1 is like being a follower of one person. You copy their answer exactly.

Choosing K = 100 is like asking a huge crowd and averaging their answers, even if many aren’t very similar to you.

So what’s the sweet spot?

The Goldilocks Problem

Think of it like porridge:

K too small (like 1 or 2): Too hot. Your model is too sensitive. It might overreact to noisy data or outliers.

Example: If your only neighbour happens to be a weirdo, you’ll end up with a weird answer.

K too large (like 100): Too cold. Your model gets too generic. It might mix in neighbours that aren't even close in behaviour.

Example: Asking 100 people for your house price, including folks who live in a different city.

K just right: Balanced. Close enough to care, but not so few that one oddball ruins the answer.

So, What Value of K Should I Use?

There’s no one-size-fits-all answer, but here are some general rules:

Rule of Thumb:

A good starting point is:

Where N is the total number of data points you have.

So if you have 100 data points, try K = 10.

It’s not perfect, but it gets you in the ballpark.

Using Cross-Validation

“Cross-validation” is a fancy way of saying:

“Let me test different values of K and see which one gives me the best results on unseen data.”

Try K = 1 to 20, and for each one:

Train your KNN model
Test how well it predicts
Pick the K with the best score

You can do this in Python with just a few lines of code using GridSearchCV from sklearn.

Even K vs Odd K

Use an odd number for K when doing classification, especially if your classes are binary (like “yes” or “no”). This avoids tie votes.

K = 3 → Neighbours vote: 2 yes, 1 no → You go with yes
K = 4 → Could be 2 yes, 2 no → Awkward

Clean Data = Better K

Keep in mind:

Normalize your data (so one feature doesn’t dominate)
Remove outliers if possible
Test K on realistic data, not made-up toy data

KNN Regression

In KNN regression, instead of predicting a category (like “cat” or “dog”), we’re predicting a number.

Think:

“How much will this house sell for?”
“What will the temperature be tomorrow?”
“How many likes will this post get?”

KNN helps answer these by looking at the closest examples from the past, and averaging their values.

How Does KNN Work?

Let's break down the KNN algorithm into manageable steps:

Choose the value of K: Decide how many neighbours to consider. A common practice is to use an odd number to avoid ties.
Calculate distances: Measure the distance between the new data point and all existing points in the dataset. The Euclidean distance is commonly used:
Identify nearest neighbours: Select the 'k' points closest to the new data point.
Classify or predict: For classification, assign the most frequent category among the neighbours. For regression, calculate the average value.

Math Behind KNN Simplified

At its core, KNN relies on distance metrics to determine similarity:

Euclidean Distance: Measures the straight-line distance between two points.
Manhattan Distance: Calculates the distance between two points by only moving horizontally and vertically.
Minkowski Distance: A generalization of both Euclidean and Manhattan distances.

A graphical example of the math behind KNN

Strengths and Weaknesses of KNN

Strengths:

Easy to understand and implement
No training phase; it's a lazy learner
Effective with small datasets and non-linear relationships

Weaknesses:

Computationally intensive with large datasets
Sensitive to irrelevant features and outliers
Performance depends on the choice of 'k'

Practical Use Cases of KNN

You’ve got the theory. You’ve seen the math. But let’s be real—none of that matters much if you don’t know where or why you’d actually use K-Nearest Neighbours in the real world. The good news? KNN is like that one friend who may not be flashy, but always shows up when it counts.

Let’s look at some areas where KNN really shines:

1. Recommendation Systems (a.k.a. “If You Liked That, You’ll Love This”)

Ever wondered how Netflix suggests what you might want to watch next? Or how Amazon knows you might want running shoes after buying a yoga mat?

KNN can help make these kinds of predictions by looking at what other users—your “nearest neighbours”—have liked, bought, or rated highly. If you and another person have a similar history, KNN assumes you might like the same things too. It’s like word-of-mouth, powered by math.

Why KNN works well here: It doesn’t need a complex model—just good old similarity-based logic.

2. Medical Diagnosis (Pattern Recognition, But Smarter)

In healthcare, identifying whether a tumour is benign or malignant based on test data can be a life-saving classification task. KNN can compare a new patient’s test results to past data, and if most “neighbouring” cases were benign, it assumes the new one probably is too.

Why KNN works well here: Medical data often follows known patterns. KNN picks up on those without needing assumptions about the data’s distribution.

3. Credit Scoring and Fraud Detection (Friend or Fraud?)

Banks and credit card companies use KNN to assess whether a new customer is likely to repay a loan—or whether a sudden $1,000 charge in a foreign country is legit. KNN helps by comparing current data with past behaviour patterns. If it looks nothing like your regular activity, it raises a red flag.

Why KNN works well here: It’s sensitive to anomalies. KNN notices when something just doesn’t “fit in.”

4. Sports and Player Stats (Finding the MVPs)

In sports analytics, KNN can be used to group athletes with similar performance metrics. For example, if you want to find players who perform like LeBron James based on points, rebounds, and assists, KNN will find the closest statistical matches.

Why KNN works well here: It’s great at clustering similar performance profiles—even when the data isn’t linear.

5. Image Recognition (Spot the Difference)

KNN is surprisingly handy in computer vision tasks, especially when classifying simple images like handwritten digits (think: ZIP code scanners at the post office). It compares the pixel patterns of new images to a set of known examples and assigns the label of the closest match.

Why KNN works well here: It’s intuitive and doesn’t require feature engineering or model training—just clean, labelled data.

KNN in Python

KNeighborsClassifier (for classification)

Here is the syntax for KNN Classification in Python

 1. from sklearn.neighbors import KNeighborsClassifier
 2.  
 3. # Instantiate the model
 4. knn_classifier = KNeighborsClassifier(n_neighbors=5)  # 5 is the default
 5.  
 6. # Fit the model to training data
 7. knn_classifier.fit(X_train, y_train)
 8.  
 9. # Predict on new/test data
10. y_pred = knn_classifier.predict(X_test)

Optional parameters you can set:

n_neighbors: number of neighbours to use (default: 5)
weights: 'uniform' (default) or 'distance'
metric: distance metric (default: 'minkowski' with p=2 is Euclidean)

KNeighborsRegressor (for regression)

 1. from sklearn.neighbors import KNeighborsRegressor
 2.  
 3. # Instantiate the model
 4. knn_regressor = KNeighborsRegressor(n_neighbors=5)
 5.  
 6. # Fit the model to training data
 7. knn_regressor.fit(X_train, y_train)
 8.  
 9. # Predict on new/test data
10. y_pred = knn_regressor.predict(X_test)

The same optional parameters apply here as well.

KNN Using the Iris Dataset

build a basic KNN classification model using the Iris dataset, one of the most popular beginner datasets in machine learning.

Step 1: Import Required Libraries

1. from sklearn.datasets import load_iris
2. from sklearn.model_selection import train_test_split
3. from sklearn.preprocessing import StandardScaler
4. from sklearn.neighbors import KNeighborsClassifier
5. from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

load_iris: gets our dataset
train_test_split: splits our data into training and test sets
StandardScaler: normalizes features so one doesn't overpower the others
KNeighborsClassifier: the actual KNN model
accuracy_score, classification_report, confusion_matrix: for evaluation

Step 2: Load the Iris Dataset

1. iris = load_iris()
2. X = iris.data  # Features: sepal and petal length/width
3. y = iris.target  # Labels: 0 = setosa, 1 = versicolor, 2 = virginica4.

‍

Step 3: Split into Training and Test Sets

1. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We’re using 80% of the data for training, 20% for testing.

Step 4: Scale the Features

1. scaler = StandardScaler()
2. X_train_scaled = scaler.fit_transform(X_train)
3. X_test_scaled = scaler.transform(X_test)

Scaling helps KNN because it depends on distances between data points. Without this, features like "petal length" could outweigh "sepal width" just because it has bigger numbers.

Step 5: Fit the KNN Model

1. knn = KNeighborsClassifier(n_neighbors=3)
2. knn.fit(X_train_scaled, y_train)

Here we choose K = 3 (i.e., it looks at the 3 closest neighbors to make a prediction).

Step 6: Make Predictions

1. y_pred = knn.predict(X_test_scaled)

This uses the trained model to predict the species of flowers in our test set.

Step 7: Evaluate the Model

1. # Basic accuracy
2. print("Accuracy:", accuracy_score(y_test, y_pred))
3.  
4. # More detailed performance
5. print("\nClassification Report:\n", classification_report(y_test, y_pred))
6.  
7. # Confusion matrix8. print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Example Output (What You Might See)
‍

 1. Accuracy: 1.0 
 2.   
 3. Classification Report: 
 4.               precision    recall  f1-score   support 
 5.   
 6.            0       1.00      1.00      1.00         9 
 7.            1       1.00      1.00      1.00        11 
 8.            2       1.00      1.00      1.00        10 
 9.  
 10.     accuracy                           1.00        30
 11.    macro avg       1.00      1.00      1.00        30
 12. weighted avg       1.00      1.00      1.00        30
 13.  
 14. Confusion Matrix:
 15. [[ 9  0  0]
 16.  [ 0 11  0]
 17.  [ 0  0 10]]

Perfect score! (Note: This isn’t always the case. Real-world data is usually messier.)

Optional Tweaks

Try different values of K (e.g., 5, 7, 9)
Use GridSearchCV to find the best K
Plot a confusion matrix using seaborn for better visuals

Advantages and Disadvantages of KNN

Pros – Why People Love KNN

1. Simple and Intuitive

KNN is one of the most easy-to-understand machine learning algorithms out there.

Imagine you're in a new city, trying to find a good restaurant. You ask the locals where they eat. That’s basically KNN—you trust the “nearest neighbours” to guide your choice.

No need to understand complex math.
No fancy model training.
Just compare data points and see what the neighbours are doing.

2. No Training Time (Lazy Learning)

KNN is called a lazy algorithm, not because it’s sloppy—but because it doesn’t learn a model in advance.

It just stores the data.
When it needs to make a prediction, it looks at the existing data and does some distance math on the spot.

This means:

Zero training time
Ideal if you need fast setup and don’t want to build a model up front

3. Versatile (Classification or Regression)

KNN can do both classification (is it a cat or a dog?) and regression (how much will the house cost?).

All it changes is:

Classification = majority vote from neighbours
Regression = average value from neighbours

4. Good with Multi-Class Problems

Many algorithms struggle when you have more than two categories to choose from. Not KNN. It handles multi-class classification (like identifying different flower species) with no extra effort.

Cons: Where KNN Struggles

1. Slow Prediction on Large Datasets

Because KNN doesn't build a model up front, it does all the work at prediction time.

Every time you ask a question, it looks through the entire dataset to find the nearest neighbours.

On small datasets? Not a big deal.

But on huge datasets (millions of rows)? KNN gets painfully slow, especially if used in real-time applications.

2. Sensitive to Irrelevant Features

KNN bases its decisions on distance. If your data has irrelevant features (like someone's zip code when predicting their favourite snack), it can mess things up.

Think of it like judging how similar two people are by including random info like the number of keys on their keychain.

Solution: Carefully select features or use feature selection techniques.

Understand the K-Nearest Neighbours algorithm with ease as we break down its theory and practical usage in simple terms. Ideal for beginners exploring machine learning fundamentals.

3. Sensitive to Feature Scaling

Let’s say your dataset has:

Age (ranging from 18 to 80)
Income (ranging from $20,000 to $500,000)

The larger numbers (income) will dominate the distance calculations—even if they aren’t more important.

That’s why scaling your features (like using StandardScaler or MinMaxScaler) is essential in KNN.

4. Struggles with High-Dimensional Data (a.k.a. “Curse of Dimensionality”)

When you have too many features (dimensions), the idea of “closeness” starts to break down.

In high dimensions, every point starts to look kind of far away from every other point.

This makes KNN less effective, and sometimes downright confusing, in high-dimensional datasets.

Fix: Use dimensionality reduction (like PCA) to simplify the data before applying KNN.

Use KNN when:
- You have a small to medium-sized dataset.
- The relationship between data points is clear and based on proximity.
- You do not need to make strong assumptions about data distribution.
- You can manage computational complexity.
Avoid KNN when:
- You are dealing with large datasets or need real-time predictions.
- The data has high dimensionality or noise.
- The dataset is imbalanced or contains outliers.
- Feature scaling is an issue, and normalisation is not possible or practical.

By considering these factors, you can decide when KNN is the best fit and when another algorithm might be more appropriate.

Conclusion

In conclusion, K-Nearest Neighbours (KNN) is a simple, easy-to-understand algorithm that can be really powerful for both classification and regression tasks. It’s like having a trusty neighbour to help you make decisions based on the people closest to you. However, as with any tool, it’s important to know when to use it and when it might not be the best choice, especially for large datasets or when there’s a lot of noise in your data.

If you're ready to take your data science skills to the next level and learn more about algorithms like KNN (and many others), why not check out Skillcamper’s courses? Whether you’re just starting out or looking to refine your skills, Skillcamper offers practical, hands-on learning that can help you master the tools you need to succeed in the world of data science. So go ahead, dive in, and start learning today!

K-Nearest Neighbours Simplified: From Math Theory to Practical Usage

What is K-Nearest Neighbours (KNN)?

What is K?

Why K Matters

So, What Value of K Should I Use?

Using Cross-Validation

Even K vs Odd K

Clean Data = Better K

KNN Regression

How Does KNN Work?

Math Behind KNN Simplified

Strengths and Weaknesses of KNN

Practical Use Cases of KNN

KNN in Python

KNeighborsClassifier (for classification)

KNeighborsRegressor (for regression)

KNN Using the Iris Dataset

Advantages and Disadvantages of KNN

Conclusion

SIMILAR BLOGS

Interested in Writing for Us?

Explore Courses

Master Python Libraries for Data Science

Data Visualisation in Python: A Beginner’s Guide

Python for Beginners: Start Your Coding Journey

OUR WRITERS

Rahul Rego

Arsha P. Joy

Sahin Ahmed

Saumya Khare

Get our stories delivered from us to your inbox weekly.

K-Nearest Neighbours Simplified: From Math Theory to Practical Usage

What is K-Nearest Neighbours (KNN)?

What is K?

Why K Matters

So, What Value of K Should I Use?

Using Cross-Validation

Even K vs Odd K

Clean Data = Better K

KNN Regression

How Does KNN Work?

Math Behind KNN Simplified

Strengths and Weaknesses of KNN

Practical Use Cases of KNN

KNN in Python

KNeighborsClassifier (for classification)

KNeighborsRegressor (for regression)

KNN Using the Iris Dataset

Advantages and Disadvantages of KNN

Conclusion

SIMILAR BLOGS

Clustering Techniques in Machine Learning: K-Means vs. DBSCAN vs. Hierarchical Clustering

The Beginner's Guide to IoT: Understanding the Basics

How Much Does a Data Engineer Earn in 2025? Salary Insights & Career Worth

Interested in Writing for Us?

Explore Courses

Master Python Libraries for Data Science

Data Visualisation in Python: A Beginner’s Guide

Python for Beginners: Start Your Coding Journey

OUR WRITERS

Rahul Rego

Arsha P. Joy

Sahin Ahmed

Saumya Khare

Get our stories delivered from us to your inbox weekly.