How to do DBSCAN clustering in R?

This recipe helps you do DBSCAN clustering in R
Last Updated: 11 Apr 2023

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

Organised logical groups of information is preferred over unorganised data by people. For example, anyone finds it easier to remember information when it is clustered together by taking its common characteristics into account.

Likewise, A machine learning technique that provides a way to find groups/clusters of different observations within a dataset is called Clustering. In this technique due to the absence of response variable, it is considered to be an unsupervised method. This implies that the relationships between 'n' number of observations is found without being trained by a response variable. The few applications of Clustering analysis are

Customer segmentation: process for dividing customers into groups based on similar characteristics.
Stock Market Clustering based on the performance of the stocks
Reducing Dimensionality

There are two most common Clustering algorithm that is used:

KMeans Clustering: commonly used when we have large dataset
Heirarchical or Agglomerative Clustering: commonly used when we have small dataset
Density based clustering (DBSCAN)
Affinity Propogation

Density Based Clustering was first introduced in the following paper in 1996: Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In, 226–31. AAAI Press.

It is a partitioning method that groups the objects based on the densely populated area. Hence, we are able to create clusters of varied shapes and sizes by being robust to noise and outliers in the data. On the other hand , KMeans clustering deals mainly with circular shape. One of the parameter that we use in DBSCAN is eps value which is the radius of the cluster.

This recipe demonstrates Density Based Clustering using on a real-life Mall dataset to carry out customer segmentation in R-language.

Recipe Objective

STEP 1: Importing Necessary Libraries

# For Data Manipulation library(tidyverse) # For Clustering algorithm library(cluster) install.packages("fpc") library(fpc) install.packages("dbscan") library(dbscan) # for cluster visualisation library(factoextra)

STEP 2: Loading the Dataset

Dataset description: It is a basic data about the customers going to the supermarket mall. This can be used for customer segmentation. There are 200 observations(customers) and no missing data.

It consists of four columns ie. measured attrutes:

CustomerID is the customer identification number.
Gender is Female and Male.
Age is the age of customers.
Annual Income (k) is the annual income of clients in thousands of dollars.
Spending Score (1-100) is the spending score assigned by the shopping center according to the customer's purchasing behavior

# creating a dataframe customer_seg customer_seg = read.csv('R_292_Mall_Customers.csv') # getting the required information about the dataset glimpse(customer_seg)

Observations: 200
Variables: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Gender                  Male, Male, Female, Female, Female, Female, ...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

For the simplicity of demonstrating heirarchichal Clustering with visualisation, we will only consider two measured attributes (Age and Annual Income).

# assigning columns 3 and 4 to a new dataset customer_prep customer_prep = customer_seg[3:4]

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the dataset customer_prep = scale(customer_prep) customer_prep %>% head()

Age		Annual.Income..k..
-1.4210029	-1.734646
-1.2778288	-1.734646
-1.3494159	-1.696572
-1.1346547	-1.696572
-0.5619583	-1.658498
-1.2062418	-1.658498

STEP 4: Obtaining Optimal value of eps

We use the kNNdistplot(data, k=) function to carry this out task. It calculates the radius of the clusters.

# to plot the eps values eps_plot = kNNdistplot(customer_prep, k=3) # to draw an optimum line eps_plot %>% abline(h = 0.45, lty = 2)

STEP 5: Performing dbscan to the dataset

We will use dbscan::dbscan() function in dbscan package in R to perform this. The two arguements used below are:

data
eps value
minimum number of points within the eps

# This is an assignment of random state set.seed(50) # creation of an object km which store the output of the function kmeans d <- dbscan::dbscan(customer_prep, eps = 0.45, MinPts = 2) d

DBSCAN clustering for 200 objects.
Parameters: eps = 0.45, minPts = 2
The clustering contains 2 cluster(s) and 1 noise points.

  0   1   2 
  1 197   2 

Available fields: cluster, eps, minPts

Note: 0 cluster indicates that these points are not in any of the clusters

STEP 6: Cluster Visualization

# cluster visualisation fviz_cluster(d, customer_prep, geom = "point")

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build a Graph Based Recommendation System in Python-Part 2

In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.

View Project Details

Llama2 Project for MetaData Generation using FAISS and RAGs

In this LLM Llama2 Project, you will automate metadata generation using Llama2, RAGs, and AWS to reduce manual efforts.

View Project Details

Build a CNN Model with PyTorch for Image Classification

In this deep learning project, you will learn how to build an Image Classification Model using PyTorch CNN

View Project Details

Abstractive Text Summarization using Transformers-BART Model

Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.

View Project Details

BigMart Sales Prediction ML Project in Python

The goal of the BigMart Sales Prediction ML project is to build and evaluate different predictive models and determine the sales of each product at a store.

View Project Details

Loan Eligibility Prediction in Python using H2O.ai

In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

View Project Details

PyCaret Project to Build and Deploy an ML App using Streamlit

In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

View Project Details

BERT Text Classification using DistilBERT and ALBERT Models

This Project Explains how to perform Text Classification using ALBERT and DistilBERT

View Project Details

A/B Testing Approach for Comparing Performance of ML Models

The objective of this project is to compare the performance of BERT and DistilBERT models for building an efficient Question and Answering system. Using A/B testing approach, we explore the effectiveness and efficiency of both models and determine which one is better suited for Q&A tasks.

View Project Details

Skip Gram Model Python Implementation for Word Embeddings

Skip-Gram Model word2vec Example -Learn how to implement the skip gram algorithm in NLP for word embeddings on a set of documents.

View Project Details

How to do DBSCAN clustering in R?

Recipe Objective

Table of Contents

STEP 1: Importing Necessary Libraries

STEP 2: Loading the Dataset

STEP 3: Data Preprocessing (Scaling)

STEP 4: Obtaining Optimal value of eps

STEP 5: Performing dbscan to the dataset

STEP 6: Cluster Visualization

Jingwei Li

Relevant Projects

You might also like

Relevant Projects