How to perform hierrarchical clustering in R?

This recipe helps you perform hierrarchical clustering in R

Recipe Objective

Organised logical groups of information is preferred over unorganised data by people. For example, anyone finds it easier to remember information when it is clustered together by taking its common characteristics into account.

Likewise, A machine learning technique that provides a way to find groups/clusters of different observations within a dataset is called Clustering. In this technique due to the absence of response variable, it is considered to be an unsupervised method. This implies that the relationships between 'n' number of observations is found without being trained by a response variable. The few applications of Clustering analysis are ​

  1. Customer segmentation: process for dividing customers into groups based on similar characteristics.
  2. Stock Market Clustering based on the performance of the stocks
  3. Reducing Dimensionality

Hands-On Guide to the Art of Tuning Locality Sensitive Hashing in Python

There are two most common Clustering algorithm that is used: ​

  1. KMeans Clustering: commonly used when we have large dataset
  2. Heirarchical Clustering: commonly used when we have small dataset

Heirarchical Clustering is an unsupervised machine learning technique that aims to groups the unlabeled dataset by building a heirarcy of clusters. It is relatively slow compared to heirarchichal clustering. There are two types of Heirarchical clustering algorithm: Divisive (top-down appraoch) and Agglomerative (bottom-up approach). ​

The most commonly used is the agglomerative algorithm. In this algorithm, data is split into n clusters where n is the observations in the dataset initially. The next step is the calculation of euclidean distances between data points. Then, the number of clusters are reduced because the two clusters merges into one in an iterative manner considering the distances between them. This process of merging stops when only one cluster remains. The hierarchy of clusters is represented by Dendrogram (a tree-like structure) ​

This recipe demonstrates Heirarchical Clustering using Agglomerative algorithm on a real-life Mall dataset to carry out customer segmentation in R-language. ​

STEP 1: Importing Necessary Libraries

# For Data Manipulation library(tidyverse) # For Clustering algorithm library(cluster)

STEP 2: Loading the Dataset

Dataset description: It is a basic data about the customers going to the supermarket mall. This can be used for customer segmentation. There are 200 observations(customers) and no missing data.

It consists of four columns ie. measured attrutes: ​

  1. CustomerID is the customer identification number.
  2. Gender is Female and Male.
  3. Age is the age of customers.
  4. Annual Income (k) is the annual income of clients in thousands of dollars.
  5. Spending Score (1-100) is the spending score assigned by the shopping center according to the customer's purchasing behavior

# creating a dataframe customer_seg customer_seg = read.csv('R_241_Mall_Customers.csv') # getting the required information about the dataset glimpse(customer_seg)

Observations: 200
Variables: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Gender                  Male, Male, Female, Female, Female, Female, ...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

For the simplicity of demonstrating heirarchichal Clustering with visualisation, we will only consider two measured attributes (Age and Annual Income). ​

# assigning columns 3 and 4 to a new dataset customer_prep customer_prep = customer_seg[3:4]

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the dataset customer_prep = scale(customer_prep) customer_prep %>% head()

Age		Annual.Income..k..
-1.4210029	-1.734646
-1.2778288	-1.734646
-1.3494159	-1.696572
-1.1346547	-1.696572
-0.5619583	-1.658498
-1.2062418	-1.658498

STEP 4: Calculating the distances between the observations/datapoints

We use the dist() function to carry this out task. It calculates the euclidean distances between the datapoints. We create an object 'distances' to store this information.

distances = dist(customer_prep, method = 'euclidean') distances %>% head()

0.143174096235284 0.0810822191043143 0.288868323164455 0.862412934253182 0.227861432257651 1.15107392658714

STEP 5: Performing Heirarchical clustering

We will use hclust() function in cluster library in R to perform this. The two arguements used below are:

  1. Distances between the point
  2. method of evaluation: Ward's method (Ward.D)

# This is an assignment of random state set.seed(50) # creation of an object km which store the output of the function kmeans h_clust = hclust(distances, method = 'ward.D') # plotting the dendrogram to represent Heirarchical Clustering plot(h_clust, main = paste('Dendrogram'), xlab = 'Customers', ylab = 'Euclidean distances')

STEP 6: Data Visualation using scatter plot with specified number of clusters

We use cutree() function in cluster library to specify the number of clusters to be formed. This function cuts the dendrogram in such a way that only the specified number of clusters are obtained. In our case, we will use 5 number of clusters.

y = cutree(h_clust, 5) # y is a vector of integers that showcases the cluster in which each observation lies.

We use clusplot to plot theclusters in a scatter plot w.r.t Age and Income

# using clusplot() function with various arguements to plot the clusters clusplot(customer_prep, y, shade = TRUE, color = TRUE, span = TRUE, main = paste('Clusters of customers'), xlab = 'Age', ylab = 'Annual Income')

This plot helps us to analyse the different clusters of customers formed so that we can target the respective clusters seperately in our marketing strategy.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Build a Graph Based Recommendation System in Python -Part 1
Python Recommender Systems Project - Learn to build a graph based recommendation system in eCommerce to recommend products.

Learn to Build an End-to-End Machine Learning Pipeline - Part 2
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, incorporating Hopsworks' feature store and Weights and Biases for model experimentation.

Build an Image Segmentation Model using Amazon SageMaker
In this Machine Learning Project, you will learn to implement the UNet Architecture and build an Image Segmentation Model using Amazon SageMaker

Llama2 Project for MetaData Generation using FAISS and RAGs
In this LLM Llama2 Project, you will automate metadata generation using Llama2, RAGs, and AWS to reduce manual efforts.

Build an AI Chatbot from Scratch using Keras Sequential Model
In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

Build a Hybrid Recommender System in Python using LightFM
In this Recommender System project, you will build a hybrid recommender system in Python using LightFM .

MLOps Project on GCP using Kubeflow for Model Deployment
MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

Stock Price Prediction Project using LSTM and RNN
Learn how to predict stock prices using RNN and LSTM models. Understand deep learning concepts and apply them to real-world financial data for accurate forecasting.

Build Customer Propensity to Purchase Model in Python
In this machine learning project, you will learn to build a machine learning model to estimate customer propensity to purchase.

Classification Projects on Machine Learning for Beginners - 1
Classification ML Project for Beginners - A Hands-On Approach to Implementing Different Types of Classification Algorithms in Machine Learning for Predictive Modelling