How to do DBSCAN clustering in R?

This recipe helps you do DBSCAN clustering in R
Last Updated: 11 Apr 2023

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

Organised logical groups of information is preferred over unorganised data by people. For example, anyone finds it easier to remember information when it is clustered together by taking its common characteristics into account.

Likewise, A machine learning technique that provides a way to find groups/clusters of different observations within a dataset is called Clustering. In this technique due to the absence of response variable, it is considered to be an unsupervised method. This implies that the relationships between 'n' number of observations is found without being trained by a response variable. The few applications of Clustering analysis are

Customer segmentation: process for dividing customers into groups based on similar characteristics.
Stock Market Clustering based on the performance of the stocks
Reducing Dimensionality

There are two most common Clustering algorithm that is used:

KMeans Clustering: commonly used when we have large dataset
Heirarchical or Agglomerative Clustering: commonly used when we have small dataset
Density based clustering (DBSCAN)
Affinity Propogation

Density Based Clustering was first introduced in the following paper in 1996: Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In, 226–31. AAAI Press.

It is a partitioning method that groups the objects based on the densely populated area. Hence, we are able to create clusters of varied shapes and sizes by being robust to noise and outliers in the data. On the other hand , KMeans clustering deals mainly with circular shape. One of the parameter that we use in DBSCAN is eps value which is the radius of the cluster.

This recipe demonstrates Density Based Clustering using on a real-life Mall dataset to carry out customer segmentation in R-language.

Recipe Objective

STEP 1: Importing Necessary Libraries

# For Data Manipulation library(tidyverse) # For Clustering algorithm library(cluster) install.packages("fpc") library(fpc) install.packages("dbscan") library(dbscan) # for cluster visualisation library(factoextra)

STEP 2: Loading the Dataset

Dataset description: It is a basic data about the customers going to the supermarket mall. This can be used for customer segmentation. There are 200 observations(customers) and no missing data.

It consists of four columns ie. measured attrutes:

CustomerID is the customer identification number.
Gender is Female and Male.
Age is the age of customers.
Annual Income (k) is the annual income of clients in thousands of dollars.
Spending Score (1-100) is the spending score assigned by the shopping center according to the customer's purchasing behavior

# creating a dataframe customer_seg customer_seg = read.csv('R_292_Mall_Customers.csv') # getting the required information about the dataset glimpse(customer_seg)

Observations: 200
Variables: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Gender                  Male, Male, Female, Female, Female, Female, ...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

For the simplicity of demonstrating heirarchichal Clustering with visualisation, we will only consider two measured attributes (Age and Annual Income).

# assigning columns 3 and 4 to a new dataset customer_prep customer_prep = customer_seg[3:4]

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the dataset customer_prep = scale(customer_prep) customer_prep %>% head()

Age		Annual.Income..k..
-1.4210029	-1.734646
-1.2778288	-1.734646
-1.3494159	-1.696572
-1.1346547	-1.696572
-0.5619583	-1.658498
-1.2062418	-1.658498

STEP 4: Obtaining Optimal value of eps

We use the kNNdistplot(data, k=) function to carry this out task. It calculates the radius of the clusters.

# to plot the eps values eps_plot = kNNdistplot(customer_prep, k=3) # to draw an optimum line eps_plot %>% abline(h = 0.45, lty = 2)

STEP 5: Performing dbscan to the dataset

We will use dbscan::dbscan() function in dbscan package in R to perform this. The two arguements used below are:

data
eps value
minimum number of points within the eps

# This is an assignment of random state set.seed(50) # creation of an object km which store the output of the function kmeans d <- dbscan::dbscan(customer_prep, eps = 0.45, MinPts = 2) d

DBSCAN clustering for 200 objects.
Parameters: eps = 0.45, minPts = 2
The clustering contains 2 cluster(s) and 1 noise points.

  0   1   2 
  1 197   2 

Available fields: cluster, eps, minPts

Note: 0 cluster indicates that these points are not in any of the clusters

STEP 6: Cluster Visualization

# cluster visualisation fviz_cluster(d, customer_prep, geom = "point")

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More