How to do DBSCAN clustering in R?

This recipe helps you do DBSCAN clustering in R

Recipe Objective

Organised logical groups of information is preferred over unorganised data by people. For example, anyone finds it easier to remember information when it is clustered together by taking its common characteristics into account.

Likewise, A machine learning technique that provides a way to find groups/clusters of different observations within a dataset is called Clustering. In this technique due to the absence of response variable, it is considered to be an unsupervised method. This implies that the relationships between 'n' number of observations is found without being trained by a response variable. The few applications of Clustering analysis are ​

  1. Customer segmentation: process for dividing customers into groups based on similar characteristics.
  2. Stock Market Clustering based on the performance of the stocks
  3. Reducing Dimensionality

There are two most common Clustering algorithm that is used: ​

  1. KMeans Clustering: commonly used when we have large dataset
  2. Heirarchical or Agglomerative Clustering: commonly used when we have small dataset
  3. Density based clustering (DBSCAN)
  4. Affinity Propogation

Density Based Clustering was first introduced in the following paper in 1996: Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In, 226–31. AAAI Press. ​

It is a partitioning method that groups the objects based on the densely populated area. Hence, we are able to create clusters of varied shapes and sizes by being robust to noise and outliers in the data. On the other hand , KMeans clustering deals mainly with circular shape. One of the parameter that we use in DBSCAN is eps value which is the radius of the cluster. ​

This recipe demonstrates Density Based Clustering using on a real-life Mall dataset to carry out customer segmentation in R-language. ​

STEP 1: Importing Necessary Libraries

# For Data Manipulation library(tidyverse) # For Clustering algorithm library(cluster) install.packages("fpc") library(fpc) install.packages("dbscan") library(dbscan) # for cluster visualisation library(factoextra)

STEP 2: Loading the Dataset

Dataset description: It is a basic data about the customers going to the supermarket mall. This can be used for customer segmentation. There are 200 observations(customers) and no missing data.

It consists of four columns ie. measured attrutes: ​

  1. CustomerID is the customer identification number.
  2. Gender is Female and Male.
  3. Age is the age of customers.
  4. Annual Income (k) is the annual income of clients in thousands of dollars.
  5. Spending Score (1-100) is the spending score assigned by the shopping center according to the customer's purchasing behavior

# creating a dataframe customer_seg customer_seg = read.csv('R_292_Mall_Customers.csv') # getting the required information about the dataset glimpse(customer_seg)

Observations: 200
Variables: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Gender                  Male, Male, Female, Female, Female, Female, ...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

For the simplicity of demonstrating heirarchichal Clustering with visualisation, we will only consider two measured attributes (Age and Annual Income). ​

# assigning columns 3 and 4 to a new dataset customer_prep customer_prep = customer_seg[3:4]

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the dataset customer_prep = scale(customer_prep) customer_prep %>% head()

Age		Annual.Income..k..
-1.4210029	-1.734646
-1.2778288	-1.734646
-1.3494159	-1.696572
-1.1346547	-1.696572
-0.5619583	-1.658498
-1.2062418	-1.658498

STEP 4: Obtaining Optimal value of eps

We use the kNNdistplot(data, k=) function to carry this out task. It calculates the radius of the clusters.

# to plot the eps values eps_plot = kNNdistplot(customer_prep, k=3) # to draw an optimum line eps_plot %>% abline(h = 0.45, lty = 2)

STEP 5: Performing dbscan to the dataset

We will use dbscan::dbscan() function in dbscan package in R to perform this. The two arguements used below are:

  1. data
  2. eps value
  3. minimum number of points within the eps

# This is an assignment of random state set.seed(50) # creation of an object km which store the output of the function kmeans d <- dbscan::dbscan(customer_prep, eps = 0.45, MinPts = 2) d

DBSCAN clustering for 200 objects.
Parameters: eps = 0.45, minPts = 2
The clustering contains 2 cluster(s) and 1 noise points.

  0   1   2 
  1 197   2 

Available fields: cluster, eps, minPts

Note: 0 cluster indicates that these points are not in any of the clusters

STEP 6: Cluster Visualization

# cluster visualisation fviz_cluster(d, customer_prep, geom = "point")

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Build a Text Generator Model using Amazon SageMaker
In this Deep Learning Project, you will train a Text Generator Model on Amazon Reviews Dataset using LSTM Algorithm in PyTorch and deploy it on Amazon SageMaker.

Build CI/CD Pipeline for Machine Learning Projects using Jenkins
In this project, you will learn how to create a CI/CD pipeline for a search engine application using Jenkins.

Build a Multi-Class Classification Model in Python on Saturn Cloud
In this machine learning classification project, you will build a multi-class classification model in Python on Saturn Cloud to predict the license status of a business.

Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

Topic modelling using Kmeans clustering to group customer reviews
In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

ML Model Deployment on AWS for Customer Churn Prediction
MLOps Project-Deploy Machine Learning Model to Production Python on AWS for Customer Churn Prediction

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

FEAST Feature Store Example for Scaling Machine Learning
FEAST Feature Store Example- Learn to use FEAST Feature Store to manage, store, and discover features for customer churn prediction machine learning project.

Time Series Classification Project for Elevator Failure Prediction
In this Time Series Project, you will predict the failure of elevators using IoT sensor data as a time series classification machine learning problem.