How to normalize and standardize data in R?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

How to normalize and standardize data in R?

How to normalize and standardize data in R?

This recipe helps you normalize and standardize data in R

0

Recipe Objective

It is very crucial to normalise or standardise the data before creating a machine learning model. This is because the machine learning algorithm tends to be dominated by the variables with larger scale and affects the performance of the model. Hence, normalisation and standardization techniques are required to bring all the numeric variables to the specific range so that the model performance is not affected. It is one of the data preprocessing applied only to the independent variables. ​

In this recipe, we will learn how to normalise and standardise the data in R. ​

Read the dataset

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome). ​

Independent Variables: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age

Dependent Variables: Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes')

# creating a dataframe customer_seg diabetes = read.csv('R_242_diabetes.csv') # printing the statistical summary of the data summary(diabetes)
Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
    Outcome     
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000  

Note: you can clearly see from the above summary that ranges of teh variables differ significantly and we need to standardise or normalise in this case.

STEP 2: Standardization and Normalization

1. Standardization

This technique subtracts the mean from individual values of the variable and divide it by the standard deviation of the variable. If we assume that the variables come from a normal distribution, then standardising would bring all the values close to the standard normal distribution i.e.e mean = 0 and standard deviation = 1.

We will use scale(data_frame) function to carry out this task.

# standardising the independent variables scaled_df = scale(diabetes[,1:8]) summary(scaled_df)
Pregnancies         Glucose        BloodPressure     SkinThickness    
 Min.   :-1.1411   Min.   :-3.7812   Min.   :-3.5703   Min.   :-1.2874  
 1st Qu.:-0.8443   1st Qu.:-0.6848   1st Qu.:-0.3671   1st Qu.:-1.2874  
 Median :-0.2508   Median :-0.1218   Median : 0.1495   Median : 0.1544  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6395   3rd Qu.: 0.6054   3rd Qu.: 0.5629   3rd Qu.: 0.7186  
 Max.   : 3.9040   Max.   : 2.4429   Max.   : 2.7327   Max.   : 4.9187  
    Insulin             BMI            DiabetesPedigreeFunction
 Min.   :-0.6924   Min.   :-4.057829   Min.   :-1.1888         
 1st Qu.:-0.6924   1st Qu.:-0.595191   1st Qu.:-0.6885         
 Median :-0.4278   Median : 0.000941   Median :-0.2999         
 Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000         
 3rd Qu.: 0.4117   3rd Qu.: 0.584390   3rd Qu.: 0.4659         
 Max.   : 6.6485   Max.   : 4.452906   Max.   : 5.8797         
      Age         
 Min.   :-1.0409  
 1st Qu.:-0.7858  
 Median :-0.3606  
 Mean   : 0.0000  
 3rd Qu.: 0.6598  
 Max.   : 4.0611

You can see from the above statistical summary that the ranges of the numeric variables are almost the same and can be used for modelling

2. Normalization

Normalisation or min-max scaling brings the data between the range of 0 and 1 by subtracting the minimum from the values and dividing by the range just after that.

Note: This preserves the shape of each variable’s distribution and makes it easier for us to compare them.

We will be using "BBmisc" package in "R", a powerful package that uses the range function for carrying out normalisation.

# install and loading packages install.packages("BBmisc") library(BBmisc) # method = range for normalisation scaled_df_norm = normalize(diabetes[,1:8], method = "range", range = c(0, 1)) summary(scaled_df_norm)
Pregnancies         Glucose       BloodPressure    SkinThickness   
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.05882   1st Qu.:0.4975   1st Qu.:0.5082   1st Qu.:0.0000  
 Median :0.17647   Median :0.5879   Median :0.5902   Median :0.2323  
 Mean   :0.22618   Mean   :0.6075   Mean   :0.5664   Mean   :0.2074  
 3rd Qu.:0.35294   3rd Qu.:0.7048   3rd Qu.:0.6557   3rd Qu.:0.3232  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    Insulin             BMI         DiabetesPedigreeFunction      Age        
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000          Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.4069   1st Qu.:0.07077          1st Qu.:0.0500  
 Median :0.03605   Median :0.4769   Median :0.12575          Median :0.1333  
 Mean   :0.09433   Mean   :0.4768   Mean   :0.16818          Mean   :0.2040  
 3rd Qu.:0.15041   3rd Qu.:0.5455   3rd Qu.:0.23409          3rd Qu.:0.3333  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.00000          Max.   :1.0000 

You can see from the above statistical summary that the ranges of the numeric variables are almost the same and ca be used for modelling

Relevant Projects

Loan Eligibility Prediction in Python using H2O.ai
In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Choosing the right Time Series Forecasting Methods
There are different time series forecasting methods to forecast stock price, demand etc. In this machine learning project, you will learn to determine which forecasting method to be used when and how to apply with time series forecasting example.

Topic modelling using Kmeans clustering to group customer reviews
In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Build a Similar Images Finder with Python, Keras, and Tensorflow
Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Human Activity Recognition Using Multiclass Classification in Python
In this human activity recognition project, we use multiclass classification machine learning techniques to analyse fitness dataset from a smartphone tracker.