How to normalize and standardize data in R?

This recipe helps you normalize and standardize data in R
Last Updated: 26 Dec 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

It is very crucial to normalise or standardise the data before creating a machine learning model. This is because the machine learning algorithm tends to be dominated by the variables with larger scale and affects the performance of the model. Hence, normalisation and standardization techniques are required to bring all the numeric variables to the specific range so that the model performance is not affected. It is one of the data preprocessing applied only to the independent variables.

In this recipe, we will learn how to normalise and standardise the data in R.

German Credit Card Dataset Analysis

Recipe Objective
- Read the dataset
- STEP 2: Standardization and Normalization

Read the dataset

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome).

Independent Variables: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age

Dependent Variables: Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes')

# creating a dataframe customer_seg diabetes = read.csv('R_242_diabetes.csv') # printing the statistical summary of the data summary(diabetes)

Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
    Outcome     
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000

Note: you can clearly see from the above summary that ranges of teh variables differ significantly and we need to standardise or normalise in this case.

STEP 2: Standardization and Normalization

1. Standardization

This technique subtracts the mean from individual values of the variable and divide it by the standard deviation of the variable. If we assume that the variables come from a normal distribution, then standardising would bring all the values close to the standard normal distribution i.e.e mean = 0 and standard deviation = 1.

We will use scale(data_frame) function to carry out this task.

# standardising the independent variables scaled_df = scale(diabetes[,1:8]) summary(scaled_df)

Pregnancies         Glucose        BloodPressure     SkinThickness    
 Min.   :-1.1411   Min.   :-3.7812   Min.   :-3.5703   Min.   :-1.2874  
 1st Qu.:-0.8443   1st Qu.:-0.6848   1st Qu.:-0.3671   1st Qu.:-1.2874  
 Median :-0.2508   Median :-0.1218   Median : 0.1495   Median : 0.1544  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6395   3rd Qu.: 0.6054   3rd Qu.: 0.5629   3rd Qu.: 0.7186  
 Max.   : 3.9040   Max.   : 2.4429   Max.   : 2.7327   Max.   : 4.9187  
    Insulin             BMI            DiabetesPedigreeFunction
 Min.   :-0.6924   Min.   :-4.057829   Min.   :-1.1888         
 1st Qu.:-0.6924   1st Qu.:-0.595191   1st Qu.:-0.6885         
 Median :-0.4278   Median : 0.000941   Median :-0.2999         
 Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000         
 3rd Qu.: 0.4117   3rd Qu.: 0.584390   3rd Qu.: 0.4659         
 Max.   : 6.6485   Max.   : 4.452906   Max.   : 5.8797         
      Age         
 Min.   :-1.0409  
 1st Qu.:-0.7858  
 Median :-0.3606  
 Mean   : 0.0000  
 3rd Qu.: 0.6598  
 Max.   : 4.0611

You can see from the above statistical summary that the ranges of the numeric variables are almost the same and can be used for modelling

2. Normalization

Normalisation or min-max scaling brings the data between the range of 0 and 1 by subtracting the minimum from the values and dividing by the range just after that.

Note: This preserves the shape of each variable’s distribution and makes it easier for us to compare them.

We will be using "BBmisc" package in "R", a powerful package that uses the range function for carrying out normalisation.

# install and loading packages install.packages("BBmisc") library(BBmisc) # method = range for normalisation scaled_df_norm = normalize(diabetes[,1:8], method = "range", range = c(0, 1)) summary(scaled_df_norm)

Pregnancies         Glucose       BloodPressure    SkinThickness   
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.05882   1st Qu.:0.4975   1st Qu.:0.5082   1st Qu.:0.0000  
 Median :0.17647   Median :0.5879   Median :0.5902   Median :0.2323  
 Mean   :0.22618   Mean   :0.6075   Mean   :0.5664   Mean   :0.2074  
 3rd Qu.:0.35294   3rd Qu.:0.7048   3rd Qu.:0.6557   3rd Qu.:0.3232  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    Insulin             BMI         DiabetesPedigreeFunction      Age        
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000          Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.4069   1st Qu.:0.07077          1st Qu.:0.0500  
 Median :0.03605   Median :0.4769   Median :0.12575          Median :0.1333  
 Mean   :0.09433   Mean   :0.4768   Mean   :0.16818          Mean   :0.2040  
 3rd Qu.:0.15041   3rd Qu.:0.5455   3rd Qu.:0.23409          3rd Qu.:0.3333  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.00000          Max.   :1.0000

You can see from the above statistical summary that the ranges of the numeric variables are almost the same and ca be used for modelling

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Ecommerce product reviews - Pairwise ranking and sentiment analysis

This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

View Project Details

End-to-End Snowflake Healthcare Analytics Project on AWS-1

In this Snowflake Healthcare Analytics Project, you will leverage Snowflake on AWS to predict patient length of stay (LOS) in hospitals. The prediction of LOS can help in efficient resource allocation, lower the risk of staff/visitor infections, and improve overall hospital functioning.

View Project Details

How to normalize and standardize data in R?

Recipe Objective

Table of Contents

Read the dataset

STEP 2: Standardization and Normalization

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects