MACHINE LEARNING RECIPES
DATA CLEANING PYTHON
DATA MUNGING
PANDAS CHEATSHEET
ALL TAGS
# How to normalize and standardize data in R?

# How to normalize and standardize data in R?

This recipe helps you normalize and standardize data in R

It is very crucial to normalise or standardise the data before creating a machine learning model. This is because the machine learning algorithm tends to be dominated by the variables with larger scale and affects the performance of the model. Hence, normalisation and standardization techniques are required to bring all the numeric variables to the specific range so that the model performance is not affected. It is one of the data preprocessing applied only to the independent variables.

In this recipe, we will learn how to normalise and standardise the data in R.

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome).

Independent Variables: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age

Dependent Variables: Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes')

```
# creating a dataframe customer_seg
diabetes = read.csv('R_242_diabetes.csv')
# printing the statistical summary of the data
summary(diabetes)
```

Pregnancies Glucose BloodPressure SkinThickness Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00 Median : 3.000 Median :117.0 Median : 72.00 Median :23.00 Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00 Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00 Insulin BMI DiabetesPedigreeFunction Age Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00 Median : 30.5 Median :32.00 Median :0.3725 Median :29.00 Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00 Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00 Outcome Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.349 3rd Qu.:1.000 Max. :1.000

Note: you can clearly see from the above summary that ranges of teh variables differ significantly and we need to standardise or normalise in this case.

1. Standardization

This technique subtracts the mean from individual values of the variable and divide it by the standard deviation of the variable. If we assume that the variables come from a normal distribution, then standardising would bring all the values close to the standard normal distribution i.e.e mean = 0 and standard deviation = 1.

We will use scale(data_frame) function to carry out this task.

```
# standardising the independent variables
scaled_df = scale(diabetes[,1:8])
summary(scaled_df)
```

Pregnancies Glucose BloodPressure SkinThickness Min. :-1.1411 Min. :-3.7812 Min. :-3.5703 Min. :-1.2874 1st Qu.:-0.8443 1st Qu.:-0.6848 1st Qu.:-0.3671 1st Qu.:-1.2874 Median :-0.2508 Median :-0.1218 Median : 0.1495 Median : 0.1544 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 3rd Qu.: 0.6395 3rd Qu.: 0.6054 3rd Qu.: 0.5629 3rd Qu.: 0.7186 Max. : 3.9040 Max. : 2.4429 Max. : 2.7327 Max. : 4.9187 Insulin BMI DiabetesPedigreeFunction Min. :-0.6924 Min. :-4.057829 Min. :-1.1888 1st Qu.:-0.6924 1st Qu.:-0.595191 1st Qu.:-0.6885 Median :-0.4278 Median : 0.000941 Median :-0.2999 Mean : 0.0000 Mean : 0.000000 Mean : 0.0000 3rd Qu.: 0.4117 3rd Qu.: 0.584390 3rd Qu.: 0.4659 Max. : 6.6485 Max. : 4.452906 Max. : 5.8797 Age Min. :-1.0409 1st Qu.:-0.7858 Median :-0.3606 Mean : 0.0000 3rd Qu.: 0.6598 Max. : 4.0611

You can see from the above statistical summary that the ranges of the numeric variables are almost the same and can be used for modelling

2. Normalization

Normalisation or min-max scaling brings the data between the range of 0 and 1 by subtracting the minimum from the values and dividing by the range just after that.

Note: This preserves the shape of each variable’s distribution and makes it easier for us to compare them.

We will be using "BBmisc" package in "R", a powerful package that uses the range function for carrying out normalisation.

```
# install and loading packages
install.packages("BBmisc")
library(BBmisc)
# method = range for normalisation
scaled_df_norm = normalize(diabetes[,1:8], method = "range", range = c(0, 1))
summary(scaled_df_norm)
```

Pregnancies Glucose BloodPressure SkinThickness Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 1st Qu.:0.05882 1st Qu.:0.4975 1st Qu.:0.5082 1st Qu.:0.0000 Median :0.17647 Median :0.5879 Median :0.5902 Median :0.2323 Mean :0.22618 Mean :0.6075 Mean :0.5664 Mean :0.2074 3rd Qu.:0.35294 3rd Qu.:0.7048 3rd Qu.:0.6557 3rd Qu.:0.3232 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Insulin BMI DiabetesPedigreeFunction Age Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000 1st Qu.:0.00000 1st Qu.:0.4069 1st Qu.:0.07077 1st Qu.:0.0500 Median :0.03605 Median :0.4769 Median :0.12575 Median :0.1333 Mean :0.09433 Mean :0.4768 Mean :0.16818 Mean :0.2040 3rd Qu.:0.15041 3rd Qu.:0.5455 3rd Qu.:0.23409 3rd Qu.:0.3333 Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000

You can see from the above statistical summary that the ranges of the numeric variables are almost the same and ca be used for modelling

In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

There are different time series forecasting methods to forecast stock price, demand etc. In this machine learning project, you will learn to determine which forecasting method to be used when and how to apply with time series forecasting example.

In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

In this human activity recognition project, we use multiclass classification machine learning techniques to analyse fitness dataset from a smartphone tracker.