How to normalize and standardize data in R?

This recipe helps you normalize and standardize data in R

Recipe Objective

It is very crucial to normalise or standardise the data before creating a machine learning model. This is because the machine learning algorithm tends to be dominated by the variables with larger scale and affects the performance of the model. Hence, normalisation and standardization techniques are required to bring all the numeric variables to the specific range so that the model performance is not affected. It is one of the data preprocessing applied only to the independent variables. ​

In this recipe, we will learn how to normalise and standardise the data in R. ​

German Credit Card Dataset Analysis

Read the dataset

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome). ​

Independent Variables: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age

Dependent Variables: Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes')

# creating a dataframe customer_seg diabetes = read.csv('R_242_diabetes.csv') # printing the statistical summary of the data summary(diabetes)

Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
    Outcome     
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000  

Note: you can clearly see from the above summary that ranges of teh variables differ significantly and we need to standardise or normalise in this case.

STEP 2: Standardization and Normalization

1. Standardization

This technique subtracts the mean from individual values of the variable and divide it by the standard deviation of the variable. If we assume that the variables come from a normal distribution, then standardising would bring all the values close to the standard normal distribution i.e.e mean = 0 and standard deviation = 1.

We will use scale(data_frame) function to carry out this task.

# standardising the independent variables scaled_df = scale(diabetes[,1:8]) summary(scaled_df)

Pregnancies         Glucose        BloodPressure     SkinThickness    
 Min.   :-1.1411   Min.   :-3.7812   Min.   :-3.5703   Min.   :-1.2874  
 1st Qu.:-0.8443   1st Qu.:-0.6848   1st Qu.:-0.3671   1st Qu.:-1.2874  
 Median :-0.2508   Median :-0.1218   Median : 0.1495   Median : 0.1544  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6395   3rd Qu.: 0.6054   3rd Qu.: 0.5629   3rd Qu.: 0.7186  
 Max.   : 3.9040   Max.   : 2.4429   Max.   : 2.7327   Max.   : 4.9187  
    Insulin             BMI            DiabetesPedigreeFunction
 Min.   :-0.6924   Min.   :-4.057829   Min.   :-1.1888         
 1st Qu.:-0.6924   1st Qu.:-0.595191   1st Qu.:-0.6885         
 Median :-0.4278   Median : 0.000941   Median :-0.2999         
 Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000         
 3rd Qu.: 0.4117   3rd Qu.: 0.584390   3rd Qu.: 0.4659         
 Max.   : 6.6485   Max.   : 4.452906   Max.   : 5.8797         
      Age         
 Min.   :-1.0409  
 1st Qu.:-0.7858  
 Median :-0.3606  
 Mean   : 0.0000  
 3rd Qu.: 0.6598  
 Max.   : 4.0611

You can see from the above statistical summary that the ranges of the numeric variables are almost the same and can be used for modelling

2. Normalization

Normalisation or min-max scaling brings the data between the range of 0 and 1 by subtracting the minimum from the values and dividing by the range just after that.

Note: This preserves the shape of each variable’s distribution and makes it easier for us to compare them.

We will be using "BBmisc" package in "R", a powerful package that uses the range function for carrying out normalisation.

# install and loading packages install.packages("BBmisc") library(BBmisc) # method = range for normalisation scaled_df_norm = normalize(diabetes[,1:8], method = "range", range = c(0, 1)) summary(scaled_df_norm)

Pregnancies         Glucose       BloodPressure    SkinThickness   
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.05882   1st Qu.:0.4975   1st Qu.:0.5082   1st Qu.:0.0000  
 Median :0.17647   Median :0.5879   Median :0.5902   Median :0.2323  
 Mean   :0.22618   Mean   :0.6075   Mean   :0.5664   Mean   :0.2074  
 3rd Qu.:0.35294   3rd Qu.:0.7048   3rd Qu.:0.6557   3rd Qu.:0.3232  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    Insulin             BMI         DiabetesPedigreeFunction      Age        
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000          Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.4069   1st Qu.:0.07077          1st Qu.:0.0500  
 Median :0.03605   Median :0.4769   Median :0.12575          Median :0.1333  
 Mean   :0.09433   Mean   :0.4768   Mean   :0.16818          Mean   :0.2040  
 3rd Qu.:0.15041   3rd Qu.:0.5455   3rd Qu.:0.23409          3rd Qu.:0.3333  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.00000          Max.   :1.0000 

You can see from the above statistical summary that the ranges of the numeric variables are almost the same and ca be used for modelling

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

MLOps AWS Project on Topic Modeling using Gunicorn Flask
In this project we will see the end-to-end machine learning development process to design, build and manage reproducible, testable, and evolvable machine learning models by using AWS

Learn to Build an End-to-End Machine Learning Pipeline - Part 1
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, addressing a major challenge in the logistics industry.

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

Model Deployment on GCP using Streamlit for Resume Parsing
Perform model deployment on GCP for resume parsing model using Streamlit App.

Avocado Machine Learning Project Python for Price Prediction
In this ML Project, you will use the Avocado dataset to build a machine learning model to predict the average price of avocado which is continuous in nature based on region and varieties of avocado.

End-to-End ML Model Monitoring using Airflow and Docker
In this MLOps Project, you will learn to build an end to end pipeline to monitor any changes in the predictive power of model or degradation of data.

Build Piecewise and Spline Regression Models in Python
In this Regression Project, you will learn how to build a piecewise and spline regression model from scratch in Python to predict the points scored by a sports team.

End-to-End Snowflake Healthcare Analytics Project on AWS-2
In this AWS Snowflake project, you will build an end to end retraining pipeline by checking Data and Model Drift and learn how to redeploy the model if needed

Build Real Estate Price Prediction Model with NLP and FastAPI
In this Real Estate Price Prediction Project, you will learn to build a real estate price prediction machine learning model and deploy it on Heroku using FastAPI Framework.

Forecasting Business KPI's with Tensorflow and Python
In this machine learning project, you will use the video clip of an IPL match played between CSK and RCB to forecast key performance indicators like the number of appearances of a brand logo, the frames, and the shortest and longest area percentage in the video.