How to perform K means clustering in R?

This recipe helps you perform K means clustering in R

Recipe Objective

Organised logical groups of information is preferred over unorganised data by people. For example, anyone finds it easier to remember information when it is clustered together by taking its common characteristics into account.

Likewise, A machine learning technique that provides a way to find groups/clusters of different observations within a dataset is called Clustering. In this technique due to the absence of response variable, it is considered to be an unsupervised method. This implies that the relationships between 'n' number of observations is found without being trained by a response variable. The few applications of Clustering analysis are ​

  1. Customer segmentation: process for dividing customers into groups based on similar characteristics.
  2. Stock Market Clustering based on the performance of the stocks
  3. Reducing Dimensionality

There are two most common Clustering algorithm that is used: ​

  1. KMeans Clustering: commonly used when we have large dataset
  2. Heirarchical Clustering: commonly used when we have small dataset

Out of the two, KMeans Clustering is the simplest technique that aims to split the dataset into K groups/clusters. It is relatively fast compared to heirarchichal clustering. It works on the following algorithm: ​

  1. Random selection of k points which are also known as centroid. These centroids should be as far as possible from each other and the placement of these centroids critically impact the results.
  2. Calculation of euclidean distance between each centroid and each point in the data space.
  3. A point is assigned to a particular centroid based on the shortest distance. Once assigned, this can be considered as early groups.
  4. Now, Recalculation of new centroids (also known as recalibrated centres) takes place based on the points within the groups.
  5. Steps 2, 3, 4 are repeated until the centres/centroids can't move any further. This is when the algorithm stops.

This recipe demonstarate KMeans Clustering using a real-life Mall dataset to carry out customer segmentation in R-language. ​

STEP 1: Importing Necessary Libraries

# For Data Manipulation library(tidyverse) # For Clustering algorithm library(cluster)

STEP 2: Loading the Dataset

Dataset description: It is a basic data about the customers going to the supermarket mall. This can be used for customer segmentation. There are 200 observations(customers) and no missing data.

It consists of four columns ie. measured attrutes: ​

  1. CustomerID is the customer identification number.
  2. Gender is Female and Male.
  3. Age is the age of customers.
  4. Annual Income (k) is the annual income of clients in thousands of dollars.
  5. Spending Score (1-100) is the spending score assigned by the shopping center according to the customer's purchasing behavior

# creating a dataframe customer_seg customer_seg = read.csv('R_240_Mall_Customers.csv') # getting the required information about the dataset glimpse(customer_seg)

Observations: 200
Variables: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Gender                  Male, Male, Female, Female, Female, Female, ...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

For the simplicity of demonstrating K-Means Clustering with visualisation, we will only consider two measured attributes (Age and Annual Income). ​

# assigning columns 3 and 4 to a new dataset customer_prep customer_prep = customer_seg[3:4]

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the dataset customer_prep = scale(customer_prep) customer_prep %>% head()

Age		Annual.Income..k..
-1.4210029	-1.734646
-1.2778288	-1.734646
-1.3494159	-1.696572
-1.1346547	-1.696572
-0.5619583	-1.658498
-1.2062418	-1.658498

STEP 4: Performing K-Means Algorithm

We will use kmeans() function in cluster library in R to perform this. The two arguements used below are:

  1. x = dataset being used (mandatory input)
  2. centers = number of clusters (k) (mandatory input). We will use 3 in this case.

# This is an assignment of random state set.seed(50) # creation of an object km which store the output of the function kmeans km = kmeans(x = customer_prep, centers = 3) km

K-means clustering with 3 clusters of sizes 76, 62, 62

Cluster means:
         Age Annual.Income..k..
1 -0.2784359          0.9660948
2  1.2138623         -0.3553890
3 -0.8725537         -0.8288562

Clustering vector:
  [1] 3 3 3 3 3 3 3 3 2 3 2 3 2 3 3 3 3 3 2 3 3 3 2 3 2 3 2 3 3 3 2 3 2 3 2 3 3
 [38] 3 3 3 2 3 2 3 2 3 2 3 3 3 2 3 3 2 2 2 2 2 3 2 2 3 2 2 2 3 2 2 3 3 2 2 2 2
 [75] 2 3 2 3 3 2 2 3 2 2 3 2 2 3 3 2 2 3 2 1 3 3 2 3 2 3 3 2 2 3 2 3 2 2 2 2 2
[112] 3 1 3 3 3 2 2 2 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1
[186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:
[1] 51.66597 42.19537 38.32968
 (between_SS / total_SS =  66.8 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
​

The most important components to note from the output of kmeans() function are:

  1. Cluster (Access line of code: km$clusters) : It is the vector of integers indicating the assignment of each observation to a particular cluster
  2. Totss (Access line of code: km$totts): returns the total sum of squares
  3. Centers (Access line of code: km$centers): returns the matrix of centers
  4. withinss (Access line of code: km$withinss): returns the within-cluster sum of squares
  5. tot.withinss (Access line of code: km$tot.withinss): returns the total within-cluster sum of squares
  6. size (Access line of code: km$size): returns the number of points in each cluster.

Step 5: Data Visualation using scatter plot with clusters

We use clusplot() function in cluster library to plot the clusters formed w.r.t Age and Income

# contains the vector of integers indicating the assignment of each observation to a particular cluster k_means = km$cluster # using clusplot() function with various arguements to plot the clusters clusplot(customer_prep, k_means, shade = TRUE, color = TRUE, span = TRUE, main = paste('Clusters of customers'), xlab = 'Age', ylab = 'Annual Income')

This plot helps us to analyse the different clusters of customers formed so that we can target the respective clusters seperately in our marketing strategy. The optimal number of cluster scan be found by the Elbow method using a scree plot.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

Personalized Medicine: Redefining Cancer Treatment
In this Personalized Medicine Machine Learning Project you will learn to classify genetic mutations on the basis of medical literature into 9 classes.

Time Series Forecasting Project-Building ARIMA Model in Python
Build a time series ARIMA model in Python to forecast the use of arrival rate density to support staffing decisions at call centres.

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

Learn Object Tracking (SOT, MOT) using OpenCV and Python
Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

Abstractive Text Summarization using Transformers-BART Model
Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.

Word2Vec and FastText Word Embedding with Gensim in Python
In this NLP Project, you will learn how to use the popular topic modelling library Gensim for implementing two state-of-the-art word embedding methods Word2Vec and FastText models.

Learn to Build an End-to-End Machine Learning Pipeline - Part 2
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, incorporating Hopsworks' feature store and Weights and Biases for model experimentation.