How to implement K means clustering in R

In this recipe, we shall learn how to implement an unsupervised learning algorithm - the K means clustering algorithm with the help of an example in R.

Recipe Objective: How to implement K-means clustering in R?

K-mean clustering is an unsupervised learning algorithm. It is centroid-based, which means that each cluster has its centroid. The main goal of this algorithm is to reduce the sum of distances between data points and the clusters that they belong to. The steps to implement K-means clustering in R are as follows-

Learn How to use XLNet for Text Classification 

Step 1: Load the required packages

#loading required packages
library(ClusterR)
library(cluster)

Step 2: Load the dataset

We will make use of the iris dataframe. iris is an inbuilt data frame that gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. #loading the dataset

#loading the dataset
data(iris)

Step 3: Check the structure of the dataset

#checking the structure of the dataset
str(iris)

	'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

All four independent variables are of numeric types, and our dependent or predictor variable is a factor with three levels(3 species).

Step 4: Remove the y-label

#removing the species label from the dataset
df = iris[, -5]
head(df)

	  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

Step 5: Set seed

setting seed ensures that you get the same result if you start with that same seed each time you run the same process

#setting seed
set.seed(123)

Step 6: Model fitting

#fitting the k-means clustering model
km <- kmeans(df, centers = 3)
km

	K-means clustering with 3 clusters of sizes 50, 62, 38

Cluster means:
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.006000    3.428000     1.462000    0.246000
2     5.901613    2.748387     4.393548    1.433871
3     6.850000    3.073684     5.742105    2.071053

Clustering vector:
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3
[106] 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3
[141] 3 3 2 3 3 3 2 3 3 2

Within cluster sum of squares by cluster:
[1] 15.15100 39.82097 23.87947
 (between_SS / total_SS =  88.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault" 

It can be seen that the model has created 3 clusters of sizes 50, 62, 38. To check the identification of each observation, try-

#checking identification for each observation
km$cluster


Step 7: Confusion matrix

#confusion matrix
cm <- table(iris$Species, km$cluster)
cm

              1  2  3
  setosa     50  0  0
  versicolor  0 48  2
  virginica   0 14 36

It can be seen that all the observations belonging to class setosa have been put into 1 cluster correctly. 2 observations belonging to Versicolor were misclassified, and 14 observations belonging to virginica were wrongly grouped.

Step 8: Plot the clusters

#visualization
plot(df[c("Sepal.Length", "Sepal.Width")],
col = km$cluster,
main = "K-means Clustering")

#visualizing clusters
clust <- km$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")],
clust,
color = TRUE,
shade = TRUE,
labels = 2,
main = paste("Clusters of iris dataset using k-means"),
xlab = 'Sepal Length',
ylab = 'Sepal Width')

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

End-to-End Speech Emotion Recognition Project using ANN
Speech Emotion Recognition using RAVDESS Audio Dataset - Build an Artificial Neural Network Model to Classify Audio Data into various Emotions like Sad, Happy, Angry, and Neutral

End-to-End ML Model Monitoring using Airflow and Docker
In this MLOps Project, you will learn to build an end to end pipeline to monitor any changes in the predictive power of model or degradation of data.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

FEAST Feature Store Example for Scaling Machine Learning
FEAST Feature Store Example- Learn to use FEAST Feature Store to manage, store, and discover features for customer churn prediction machine learning project.

Deep Learning Project for Beginners with Source Code Part 1
Learn to implement deep neural networks in Python .

Tensorflow Transfer Learning Model for Image Classification
Image Classification Project - Build an Image Classification Model on a Dataset of T-Shirt Images for Binary Classification

PyTorch Project to Build a GAN Model on MNIST Dataset
In this deep learning project, you will learn how to build a GAN Model on MNIST Dataset for generating new images of handwritten digits.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Build a Logistic Regression Model in Python from Scratch
Regression project to implement logistic regression in python from scratch on streaming app data.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.