How to do Affinity based Clustering in R?

This recipe helps you do Affinity based Clustering in R
Last Updated: 20 Jun 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

Organised logical groups of information is preferred over unorganised data by people. For example, anyone finds it easier to remember information when it is clustered together by taking its common characteristics into account.

Likewise, A machine learning technique that provides a way to find groups/clusters of different observations within a dataset is called Clustering. In this technique due to the absence of response variable, it is considered to be an unsupervised method. This implies that the relationships between 'n' number of observations is found without being trained by a response variable. The few applications of Clustering analysis are

Customer segmentation: process for dividing customers into groups based on similar characteristics.
Stock Market Clustering based on the performance of the stocks
Reducing Dimensionality

There are two most common Clustering algorithm that is used:

KMeans Clustering: commonly used when we have large dataset
Heirarchical or Agglomerative Clustering: commonly used when we have small dataset
Density based clustering (DBSCAN)
Affinity Propogation

Affinity propagation is a clustering algorithm developed by Frey and Duecke that identifies exemplars among data points and forms clusters of data points around these exemplars. One of the drawbacks of KMeans is that it is sensitive to the initial random selection of exemplars. Affinity propagation overcomes this problem and we do not need to specify the number of clusters in advance. It compute the optimal number of clusters for us.

This recipe demonstrates Affinity Based Clustering using on a real-life Mall dataset to carry out customer segmentation in R-language.

Recipe Objective

STEP 1: Importing Necessary Libraries

# For Data Manipulation library(tidyverse) # For Clustering algorithm library(cluster) install.packages("apcluster") library(apcluster) # for cluster visualisation library(factoextra)

STEP 2: Loading the Dataset

Dataset description: It is a basic data about the customers going to the supermarket mall. This can be used for customer segmentation. There are 200 observations(customers) and no missing data.

It consists of four columns ie. measured attrutes:

CustomerID is the customer identification number.
Gender is Female and Male.
Age is the age of customers.
Annual Income (k) is the annual income of clients in thousands of dollars.
Spending Score (1-100) is the spending score assigned by the shopping center according to the customer's purchasing behavior

# creating a dataframe customer_seg customer_seg = read.csv('R_350_Mall_Customers.csv') # getting the required information about the dataset glimpse(customer_seg)

Observations: 200
Variables: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Gender                  Male, Male, Female, Female, Female, Female, ...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

For the simplicity of demonstrating heirarchichal Clustering with visualisation, we will only consider two measured attributes (Age and Annual Income).

# assigning columns 3 and 4 to a new dataset customer_prep customer_prep = customer_seg[3:4]

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the dataset customer_prep = scale(customer_prep) customer_prep %>% head()

Age		Annual.Income..k..
-1.4210029	-1.734646
-1.2778288	-1.734646
-1.3494159	-1.696572
-1.1346547	-1.696572
-0.5619583	-1.658498
-1.2062418	-1.658498

STEP 4: Performing Affinity based clustering

We use the apcluster(s = , x = ) function to carry this out task.

s = is a similarity matrix for the input data. The choice negDistMat(r=2) is the standard similarity measure used in the papers of Frey and Dueck — negative squared distances.
x = input data

# to plot the eps values a = apcluster(negDistMat(r=2), x=customer_prep) # to draw an optimum line a

APResult object

Number of samples     =  200 
Number of iterations  =  138 
Input preference      =  -2.981604 
Sum of similarities   =  -28.08527 
Sum of preferences    =  -38.76086 
Net similarity        =  -66.84613 
Number of clusters    =  13 

Exemplars:
   16 21 31 45 49 83 90 106 124 164 175 191 196
Clusters:
   Cluster 1, exemplar 16:
      1 2 3 4 6 8 14 16 18 22 30 32 34 36
   Cluster 2, exemplar 21:
      5 7 10 12 15 17 20 21 24 26 28 29 39
   Cluster 3, exemplar 31:
      9 11 13 19 25 31 41 54
   Cluster 4, exemplar 45:
      23 27 33 35 37 43 45 47 51 55 56 57 60 64 67
   Cluster 5, exemplar 49:
      38 40 42 44 46 48 49 50 52 53 59 70
   Cluster 6, exemplar 83:
      58 61 63 65 68 71 73 74 75 83 91 103 107 109 110 111 117
   Cluster 7, exemplar 90:
      72 77 80 81 84 86 87 90 93 97 99 102 105 108 118 119 120 129 131
   Cluster 8, exemplar 106:
      62 66 69 76 79 85 88 92 96 98 100 101 104 106 112 114 115 116 121 125 133 
      135 139 163
   Cluster 9, exemplar 124:
      78 82 89 94 95 113 122 123 124 127 128 130 132 137 140 151 152 153 154 
      157 167
   Cluster 10, exemplar 164:
      126 134 136 138 142 143 144 145 146 148 149 150 156 158 159 160 162 164 
      166 168 169 170 171 172 173 174 176 178
   Cluster 11, exemplar 175:
      141 147 155 161 165 175 177 179 183 187
   Cluster 12, exemplar 191:
      180 181 182 184 185 186 188 189 190 191 192
   Cluster 13, exemplar 196:
      193 194 195 196 197 198 199 200

# optimal number of clusters cat("optimal number of clusters:", length(a@clusters), "\n")

optimal number of clusters: 13

STEP 5: Cluster Visualization

# cluster visualisation plot(a, customer_prep)

This plot helps us to analyse the different clusters of customers formed so that we can target the respective clusters seperately in our marketing strategy.

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More