How to implement K means clustering in R

In this recipe, we shall learn how to implement an unsupervised learning algorithm - the K means clustering algorithm with the help of an example in R.
Last Updated: 11 Apr 2023

Get access to Data Science projects View all Data Science projects

DATA SCIENCE PROJECTS IN R DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to implement K-means clustering in R?

K-mean clustering is an unsupervised learning algorithm. It is centroid-based, which means that each cluster has its centroid. The main goal of this algorithm is to reduce the sum of distances between data points and the clusters that they belong to. The steps to implement K-means clustering in R are as follows-

Learn How to use XLNet for Text Classification

Recipe Objective: How to implement K-means clustering in R?
Step 1: Load the required packages
Step 2: Load the dataset
Step 3: Check the structure of the dataset
Step 4: Remove the y-label
Step 5: Set seed
Step 6: Model fitting
Step 7: Confusion matrix
Step 8: Plot the clusters

Step 1: Load the required packages

#loading required packages library(ClusterR) library(cluster)

Step 2: Load the dataset

We will make use of the iris dataframe. iris is an inbuilt data frame that gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. #loading the dataset

#loading the dataset data(iris)

Step 3: Check the structure of the dataset

#checking the structure of the dataset str(iris)

	'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

All four independent variables are of numeric types, and our dependent or predictor variable is a factor with three levels(3 species).

Step 4: Remove the y-label

#removing the species label from the dataset df = iris[, -5] head(df)

	  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

Step 5: Set seed

setting seed ensures that you get the same result if you start with that same seed each time you run the same process

#setting seed set.seed(123)

Step 6: Model fitting

#fitting the k-means clustering model km <- kmeans(df, centers = 3) km

	K-means clustering with 3 clusters of sizes 50, 62, 38

Cluster means:
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.006000    3.428000     1.462000    0.246000
2     5.901613    2.748387     4.393548    1.433871
3     6.850000    3.073684     5.742105    2.071053

Clustering vector:
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3
[106] 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3
[141] 3 3 2 3 3 3 2 3 3 2

Within cluster sum of squares by cluster:
[1] 15.15100 39.82097 23.87947
 (between_SS / total_SS =  88.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"

It can be seen that the model has created 3 clusters of sizes 50, 62, 38. To check the identification of each observation, try-

#checking identification for each observation km$cluster

Step 7: Confusion matrix

#confusion matrix cm <- table(iris$Species, km$cluster) cm

              1  2  3
  setosa     50  0  0
  versicolor  0 48  2
  virginica   0 14 36

It can be seen that all the observations belonging to class setosa have been put into 1 cluster correctly. 2 observations belonging to Versicolor were misclassified, and 14 observations belonging to virginica were wrongly grouped.

Step 8: Plot the clusters

#visualization plot(df[c("Sepal.Length", "Sepal.Width")], col = km$cluster, main = "K-means Clustering")
#visualizing clusters clust <- km$cluster clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")], clust, color = TRUE, shade = TRUE, labels = 2, main = paste("Clusters of iris dataset using k-means"), xlab = 'Sepal Length', ylab = 'Sepal Width')

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More