How to create and optimize a baseline Ridge Regression model in R?

This recipe helps you create and optimize a baseline Ridge Regression model in R

Recipe Objective

The subset selection methods use ordinary least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. This shrinkage is also known as regularisation.

It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance while performing variable selection. There are three types of regularisation that occurs: ​

  1. Ridge regression: This uses L2 regularization to penalise residuals when the coefficients of the predictor variables from a regression model are being learned. It involves minimising the sum of squared residuals as well as the penalised term added by L2 regularization. This addition to the residuals makes the coefficients of variables with minor contribution close to zero. This is useful when you need all the variables needs to be incorporated in the model.
  2. Lasso regression: This type of regularization makes the coefficients of variables with minor contribution exactly to zero by adding a penalised term to the loss function. Only the most significant variables are left in the final model after applying this technique.
  3. Elastic-Net regression: It is a combination of both ridge and lasso regression.

In this recipe, we will discuss how to create and optimise ridge regression model ​

STEP 1: Importing Necessary Libraries

library(caret) library(tidyverse) # for data manipulation

STEP 2: Read a csv file and explore the data

The dataset attached contains the data of 160 different bags associated with ABC industries.

The bags have certain attributes which are described below: ​

  1. Height – The height of the bag
  2. Width – The width of the bag
  3. Length – The length of the bag
  4. Weight – The weight the bag can carry
  5. Weight1 – Weight the bag can carry after expansion

The company now wants to predict the cost they should set for a new variant of these kinds of bags. ​

data <- read.csv("R_340_Data_1.csv") glimpse(data)
Rows: 159
Columns: 6
$ Cost     242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500,...
$ Weight   23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5,...
$ Weight1  25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7,...
$ Length   30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2,...
$ Height   11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.17...
$ Width    4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6...
summary(data) # returns the statistical summary of the data columns
Cost            Weight         Weight1          Length     
 Min.   :   0.0   Min.   : 7.50   Min.   : 8.40   Min.   : 8.80  
 1st Qu.: 120.0   1st Qu.:19.05   1st Qu.:21.00   1st Qu.:23.15  
 Median : 273.0   Median :25.20   Median :27.30   Median :29.40  
 Mean   : 398.3   Mean   :26.25   Mean   :28.42   Mean   :31.23  
 3rd Qu.: 650.0   3rd Qu.:32.70   3rd Qu.:35.50   3rd Qu.:39.65  
 Max.   :1650.0   Max.   :59.00   Max.   :63.40   Max.   :68.00  
     Height           Width      
 Min.   : 1.728   Min.   :1.048  
 1st Qu.: 5.945   1st Qu.:3.386  
 Median : 7.786   Median :4.248  
 Mean   : 8.971   Mean   :4.417  
 3rd Qu.:12.366   3rd Qu.:5.585  
 Max.   :18.957   Max.   :8.142   
dim(data)
159 6

STEP 3: Train Test Split

# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%) parts = createDataPartition(data$Cost, p = .8, list = F) train = data[parts, ] test = data[-parts, ]

STEP 4: Building and optimising Ridge Regression

We will use caret package to perform Cross Validation and Hyperparameter tuning (alpha and lambda values) using grid search technique. First, we will use the trainControl() function to define the method of cross validation to be carried out and search type i.e. "grid" or "random". Then train the model using train() function with tuneGrid as one of the arguements.

Syntax: train(formula, data = , method = , trControl = , tuneGrid = )

where:

  1. formula = y~x1+x2+x3+..., where y is the independent variable and x1,x2,x3 are the dependent variables
  2. data = dataframe
  3. method = Type of the model to be built ("glmnet" for ridge/lasso or Elastic regression)
  4. trControl = Takes the control parameters. We will use trainControl function out here where we will specify the Cross validation technique.
  5. tuneGrid = takes the tuning parameters and applies grid search CV on them
# specifying the CV technique which will be passed into the train() function later and number parameter is the "k" in K-fold cross validation train_control = trainControl(method = "cv", number = 5, search = "grid") ## Customsing the tuning grid (ridge regression has alpha = 0) ridgeGrid = expand.grid(alpha = 0, lambda = c(seq(0.1, 1.5, by = 0.1), seq(2,5,1), seq(5,20,2))) set.seed(50) # training a ridge Regression model while tuning parameters model = train(Cost~., data = train, method = "glmnet", trControl = train_control, tuneGrid = ridgeGrid) # summarising the results print(model)
129 samples
  5 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 103, 104, 103, 103, 103 
Resampling results across tuning parameters:

  lambda  RMSE      Rsquared   MAE    
   0.1    127.0819  0.8825179  98.1665
   0.2    127.0819  0.8825179  98.1665
   0.3    127.0819  0.8825179  98.1665
   0.4    127.0819  0.8825179  98.1665
   0.5    127.0819  0.8825179  98.1665
   0.6    127.0819  0.8825179  98.1665
   0.7    127.0819  0.8825179  98.1665
   0.8    127.0819  0.8825179  98.1665
   0.9    127.0819  0.8825179  98.1665
   1.0    127.0819  0.8825179  98.1665
   1.1    127.0819  0.8825179  98.1665
   1.2    127.0819  0.8825179  98.1665
   1.3    127.0819  0.8825179  98.1665
   1.4    127.0819  0.8825179  98.1665
   1.5    127.0819  0.8825179  98.1665
   2.0    127.0819  0.8825179  98.1665
   3.0    127.0819  0.8825179  98.1665
   4.0    127.0819  0.8825179  98.1665
   5.0    127.0819  0.8825179  98.1665
   7.0    127.0819  0.8825179  98.1665
   9.0    127.0819  0.8825179  98.1665
  11.0    127.0819  0.8825179  98.1665
  13.0    127.0819  0.8825179  98.1665
  15.0    127.0819  0.8825179  98.1665
  17.0    127.0819  0.8825179  98.1665
  19.0    127.0819  0.8825179  98.1665

Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 19.

Note: RMSE was used select the optimal model using the smallest value. And the final model has the lambda value of 19.

STEP 5: Make predictions on the final ridge regression model

We use our final ridge Regression model to make predictions on the testing data (unseen data) and predict the 'Cost' value and generate performance measures.

#use model to make predictions on test data pred_y = predict(model, test) # performance metrics on the test data test_y = test[, 1] mean((test_y - pred_y)^2) #mse - Mean Squared Error caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error
13138.7222015375
114.624265326053

Final Coefficients are mentioned below:

data.frame( ridge = as.data.frame.matrix(coef(model$finalModel, model$finalModel$lambdaOpt))) %>% rename(ridge = X1)
(Intercept)	-516.756313
Weight		8.391177
Weight1		7.564281
Length		5.726575
Height		9.096478
Width		49.89799

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Time Series Forecasting Project-Building ARIMA Model in Python
Build a time series ARIMA model in Python to forecast the use of arrival rate density to support staffing decisions at call centres.

Learn How to Build a Linear Regression Model in PyTorch
In this Machine Learning Project, you will learn how to build a simple linear regression model in PyTorch to predict the number of days subscribed.

Build a Multi Class Image Classification Model Python using CNN
This project explains How to build a Sequential Model that can perform Multi Class Image Classification in Python using CNN

A/B Testing Approach for Comparing Performance of ML Models
The objective of this project is to compare the performance of BERT and DistilBERT models for building an efficient Question and Answering system. Using A/B testing approach, we explore the effectiveness and efficiency of both models and determine which one is better suited for Q&A tasks.

Build an End-to-End AWS SageMaker Classification Model
MLOps on AWS SageMaker -Learn to Build an End-to-End Classification Model on SageMaker to predict a patient’s cause of death.

Time Series Classification Project for Elevator Failure Prediction
In this Time Series Project, you will predict the failure of elevators using IoT sensor data as a time series classification machine learning problem.

Build Portfolio Optimization Machine Learning Models in R
Machine Learning Project for Financial Risk Modelling and Portfolio Optimization with R- Build a machine learning model in R to develop a strategy for building a portfolio for maximized returns.

Deploying Machine Learning Models with Flask for Beginners
In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

Build a Face Recognition System in Python using FaceNet
In this deep learning project, you will build your own face recognition system in Python using OpenCV and FaceNet by extracting features from an image of a person's face.

NLP Project for Beginners on Text Processing and Classification
This Project Explains the Basic Text Preprocessing and How to Build a Classification Model in Python