MACHINE LEARNING RECIPES
DATA CLEANING PYTHON
DATA MUNGING
PANDAS CHEATSHEET
ALL TAGS
# How to determine optimal clusters for K means using slihoutte distance in R?

# How to determine optimal clusters for K means using slihoutte distance in R?

This recipe helps you determine optimal clusters for K means using slihoutte distance in R

Organised logical groups of information is preferred over unorganised data by people. For example, anyone finds it easier to remember information when it is clustered together by taking its common characteristics into account.

Likewise, A machine learning technique that provides a way to find groups/clusters of different observations within a dataset is called Clustering. In this technique due to the absence of response variable, it is considered to be an unsupervised method. This implies that the relationships between 'n' number of observations is found without being trained by a response variable. The few applications of Clustering analysis are

- Customer segmentation: process for dividing customers into groups based on similar characteristics.
- Stock Market Clustering based on the performance of the stocks
- Reducing Dimensionality

There are two most common Clustering algorithm that is used:

- KMeans Clustering: commonly used when we have large dataset
- Heirarchical Clustering: commonly used when we have small dataset

Out of the two, KMeans Clustering is the simplest technique that aims to split the dataset into K groups/clusters. It is relatively fast compared to heirarchichal clustering. It works on the following algorithm:

- Random selection of k points which are also known as centroid. These centroids should be as far as possible from each other and the placement of these centroids critically impact the results.
- Calculation of euclidean distance between each centroid and each point in the data space.
- A point is assigned to a particular centroid based on the shortest distance. Once assigned, this can be considered as early groups.
- Now, Recalculation of new centroids (also known as recalibrated centres) takes place based on the points within the groups.
- Steps 2, 3, 4 are repeated until the centres/centroids can't move any further. This is when the algorithm stops.

This recipe demonstarate KMeans Clustering using a real-life Mall dataset to carry out customer segmentation in R-language.

```
# For Data Manipulation
library(tidyverse)
# For Clustering algorithm
library(cluster)
```

Dataset description: It is a basic data about the customers going to the supermarket mall. This can be used for customer segmentation. There are 200 observations(customers) and no missing data.

It consists of four columns ie. measured attrutes:

- CustomerID is the customer identification number.
- Gender is Female and Male.
- Age is the age of customers.
- Annual Income (k) is the annual income of clients in thousands of dollars.
- Spending Score (1-100) is the spending score assigned by the shopping center according to the customer's purchasing behavior

```
# creating a dataframe customer_seg
customer_seg = read.csv('R_246_Mall_Customers.csv')
# getting the required information about the dataset
glimpse(customer_seg)
```

Observations: 200 Variables: 5 $ CustomerID1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1... $ Gender Male, Male, Female, Female, Female, Female, ... $ Age 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ... $ Annual.Income..k.. 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ... $ Spending.Score..1.100. 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

For the simplicity of demonstrating K-Means Clustering with visualisation, we will only consider two measured attributes (Age and Annual Income).

```
# assigning columns 3 and 4 to a new dataset customer_prep
customer_prep = customer_seg[3:4]
```

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function

Note: Scaling is an important pre-modelling step which has to be mandatory

```
# scaling the dataset
customer_prep = scale(customer_prep)
customer_prep %>% head()
```

Age Annual.Income..k.. -1.4210029 -1.734646 -1.2778288 -1.734646 -1.3494159 -1.696572 -1.1346547 -1.696572 -0.5619583 -1.658498 -1.2062418 -1.658498

Before we carry out the clustering, we need to find the optimal number of clusters for the above dataset. There are three methods to do so:

- Elbow method
- Silhoutte Method
- Gap Statistic

We will use the Silhoutte Method to do so using silhoutte scores. Average Silhoutte scores measures the quality of clustering. The higher the score, the better the clustering. We will use fviz_nbclust(x, FUNcluster , method = ) function where:

- x = dataframe
- FUNcluster = a partioning function. for eg. kmeans, hcut (for heirarchical clustering
- method = the method to be used to determine the optimal cluster

```
factoextra::fviz_nbclust(customer_prep, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
```

Note: According to the average silhouette, the optimal number of clusters are 3.

We will use kmeans() function in cluster library in R to perform this. The two arguements used below are:

- x = dataset being used (mandatory input)
- centers = number of clusters (k) (mandatory input). We will use 3 in this case.

```
# This is an assignment of random state
set.seed(50)
# creation of an object km which store the output of the function kmeans
km = kmeans(x = customer_prep, centers = 3)
km
```

K-means clustering with 3 clusters of sizes 76, 62, 62 Cluster means: Age Annual.Income..k.. 1 -0.2784359 0.9660948 2 1.2138623 -0.3553890 3 -0.8725537 -0.8288562 Clustering vector: [1] 3 3 3 3 3 3 3 3 2 3 2 3 2 3 3 3 3 3 2 3 3 3 2 3 2 3 2 3 3 3 2 3 2 3 2 3 3 [38] 3 3 3 2 3 2 3 2 3 2 3 3 3 2 3 3 2 2 2 2 2 3 2 2 3 2 2 2 3 2 2 3 3 2 2 2 2 [75] 2 3 2 3 3 2 2 3 2 2 3 2 2 3 3 2 2 3 2 1 3 3 2 3 2 3 3 2 2 3 2 3 2 2 2 2 2 [112] 3 1 3 3 3 2 2 2 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 [149] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Within cluster sum of squares by cluster: [1] 51.66597 42.19537 38.32968 (between_SS / total_SS = 66.8 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault"

The most important components to note from the output of kmeans() function are:

- Cluster (Access line of code: km$clusters) : It is the vector of integers indicating the assignment of each observation to a particular cluster
- Totss (Access line of code: km$totts): returns the total sum of squares
- Centers (Access line of code: km$centers): returns the matrix of centers
- withinss (Access line of code: km$withinss): returns the within-cluster sum of squares
- tot.withinss (Access line of code: km$tot.withinss): returns the total within-cluster sum of squares
- size (Access line of code: km$size): returns the number of points in each cluster.

We use clusplot() function in cluster library to plot the clusters formed w.r.t Age and Income

```
# contains the vector of integers indicating the assignment of each observation to a particular cluster
k_means = km$cluster
# using clusplot() function with various arguements to plot the clusters
clusplot(customer_prep, k_means, shade = TRUE, color = TRUE, span = TRUE,
main = paste('Clusters of customers'),
xlab = 'Age',
ylab = 'Annual Income')
```

This plot helps us to analyse the different clusters of customers formed so that we can target the respective clusters seperately in our marketing strategy.

Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

In this project you will use Python to implement various machine learning methods( RNN, LSTM, GRU) for fake news classification.

Data Science Project in Python- Build a machine learning algorithm that automatically suggests the right product prices.

In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.