How to do Affinity based Clustering in R?

This recipe helps you do Affinity based Clustering in R

Recipe Objective

Organised logical groups of information is preferred over unorganised data by people. For example, anyone finds it easier to remember information when it is clustered together by taking its common characteristics into account.

Likewise, A machine learning technique that provides a way to find groups/clusters of different observations within a dataset is called Clustering. In this technique due to the absence of response variable, it is considered to be an unsupervised method. This implies that the relationships between 'n' number of observations is found without being trained by a response variable. The few applications of Clustering analysis are ​

  1. Customer segmentation: process for dividing customers into groups based on similar characteristics.
  2. Stock Market Clustering based on the performance of the stocks
  3. Reducing Dimensionality

There are two most common Clustering algorithm that is used: ​

  1. KMeans Clustering: commonly used when we have large dataset
  2. Heirarchical or Agglomerative Clustering: commonly used when we have small dataset
  3. Density based clustering (DBSCAN)
  4. Affinity Propogation

Affinity propagation is a clustering algorithm developed by Frey and Duecke that identifies exemplars among data points and forms clusters of data points around these exemplars. One of the drawbacks of KMeans is that it is sensitive to the initial random selection of exemplars. Affinity propagation overcomes this problem and we do not need to specify the number of clusters in advance. It compute the optimal number of clusters for us. ​

This recipe demonstrates Affinity Based Clustering using on a real-life Mall dataset to carry out customer segmentation in R-language. ​

STEP 1: Importing Necessary Libraries

# For Data Manipulation library(tidyverse) # For Clustering algorithm library(cluster) install.packages("apcluster") library(apcluster) # for cluster visualisation library(factoextra)

STEP 2: Loading the Dataset

Dataset description: It is a basic data about the customers going to the supermarket mall. This can be used for customer segmentation. There are 200 observations(customers) and no missing data.

It consists of four columns ie. measured attrutes: ​

  1. CustomerID is the customer identification number.
  2. Gender is Female and Male.
  3. Age is the age of customers.
  4. Annual Income (k) is the annual income of clients in thousands of dollars.
  5. Spending Score (1-100) is the spending score assigned by the shopping center according to the customer's purchasing behavior

# creating a dataframe customer_seg customer_seg = read.csv('R_350_Mall_Customers.csv') # getting the required information about the dataset glimpse(customer_seg)

Observations: 200
Variables: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Gender                  Male, Male, Female, Female, Female, Female, ...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

For the simplicity of demonstrating heirarchichal Clustering with visualisation, we will only consider two measured attributes (Age and Annual Income). ​

# assigning columns 3 and 4 to a new dataset customer_prep customer_prep = customer_seg[3:4]

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the dataset customer_prep = scale(customer_prep) customer_prep %>% head()

Age		Annual.Income..k..
-1.4210029	-1.734646
-1.2778288	-1.734646
-1.3494159	-1.696572
-1.1346547	-1.696572
-0.5619583	-1.658498
-1.2062418	-1.658498

STEP 4: Performing Affinity based clustering

We use the apcluster(s = , x = ) function to carry this out task.

  1. s = is a similarity matrix for the input data. The choice negDistMat(r=2) is the standard similarity measure used in the papers of Frey and Dueck — negative squared distances.
  2. x = input data

# to plot the eps values a = apcluster(negDistMat(r=2), x=customer_prep) # to draw an optimum line a

APResult object

Number of samples     =  200 
Number of iterations  =  138 
Input preference      =  -2.981604 
Sum of similarities   =  -28.08527 
Sum of preferences    =  -38.76086 
Net similarity        =  -66.84613 
Number of clusters    =  13 

Exemplars:
   16 21 31 45 49 83 90 106 124 164 175 191 196
Clusters:
   Cluster 1, exemplar 16:
      1 2 3 4 6 8 14 16 18 22 30 32 34 36
   Cluster 2, exemplar 21:
      5 7 10 12 15 17 20 21 24 26 28 29 39
   Cluster 3, exemplar 31:
      9 11 13 19 25 31 41 54
   Cluster 4, exemplar 45:
      23 27 33 35 37 43 45 47 51 55 56 57 60 64 67
   Cluster 5, exemplar 49:
      38 40 42 44 46 48 49 50 52 53 59 70
   Cluster 6, exemplar 83:
      58 61 63 65 68 71 73 74 75 83 91 103 107 109 110 111 117
   Cluster 7, exemplar 90:
      72 77 80 81 84 86 87 90 93 97 99 102 105 108 118 119 120 129 131
   Cluster 8, exemplar 106:
      62 66 69 76 79 85 88 92 96 98 100 101 104 106 112 114 115 116 121 125 133 
      135 139 163
   Cluster 9, exemplar 124:
      78 82 89 94 95 113 122 123 124 127 128 130 132 137 140 151 152 153 154 
      157 167
   Cluster 10, exemplar 164:
      126 134 136 138 142 143 144 145 146 148 149 150 156 158 159 160 162 164 
      166 168 169 170 171 172 173 174 176 178
   Cluster 11, exemplar 175:
      141 147 155 161 165 175 177 179 183 187
   Cluster 12, exemplar 191:
      180 181 182 184 185 186 188 189 190 191 192
   Cluster 13, exemplar 196:
      193 194 195 196 197 198 199 200

# optimal number of clusters cat("optimal number of clusters:", length(a@clusters), "\n")

optimal number of clusters: 13 

STEP 5: Cluster Visualization

# cluster visualisation plot(a, customer_prep)

This plot helps us to analyse the different clusters of customers formed so that we can target the respective clusters seperately in our marketing strategy.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

AWS MLOps Project for ARCH and GARCH Time Series Models
Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .

Census Income Data Set Project-Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

Loan Default Prediction Project using Explainable AI ML Models
Loan Default Prediction Project that employs sophisticated machine learning models, such as XGBoost and Random Forest and delves deep into the realm of Explainable AI, ensuring every prediction is transparent and understandable.

Build Regression (Linear,Ridge,Lasso) Models in NumPy Python
In this machine learning regression project, you will learn to build NumPy Regression Models (Linear Regression, Ridge Regression, Lasso Regression) from Scratch.

Build Customer Propensity to Purchase Model in Python
In this machine learning project, you will learn to build a machine learning model to estimate customer propensity to purchase.

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

Stock Price Prediction Project using LSTM and RNN
Learn how to predict stock prices using RNN and LSTM models. Understand deep learning concepts and apply them to real-world financial data for accurate forecasting.

Hands-On Approach to Causal Inference in Machine Learning
In this Machine Learning Project, you will learn to implement various causal inference techniques in Python to determine, how effective the sprinkler is in making the grass wet.

Time Series Forecasting Project-Building ARIMA Model in Python
Build a time series ARIMA model in Python to forecast the use of arrival rate density to support staffing decisions at call centres.