How to create and optimize a baseline Ridge Regression model in R?

This recipe helps you create and optimize a baseline Ridge Regression model in R
Last Updated: 22 Jun 2021

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

The subset selection methods use ordinary least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. This shrinkage is also known as regularisation.

It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance while performing variable selection. There are three types of regularisation that occurs:

Ridge regression: This uses L2 regularization to penalise residuals when the coefficients of the predictor variables from a regression model are being learned. It involves minimising the sum of squared residuals as well as the penalised term added by L2 regularization. This addition to the residuals makes the coefficients of variables with minor contribution close to zero. This is useful when you need all the variables needs to be incorporated in the model.
Lasso regression: This type of regularization makes the coefficients of variables with minor contribution exactly to zero by adding a penalised term to the loss function. Only the most significant variables are left in the final model after applying this technique.
Elastic-Net regression: It is a combination of both ridge and lasso regression.

In this recipe, we will discuss how to create and optimise ridge regression model

STEP 1: Importing Necessary Libraries


library(caret)

library(tidyverse)		 # for data manipulation

STEP 2: Read a csv file and explore the data

The dataset attached contains the data of 160 different bags associated with ABC industries.

The bags have certain attributes which are described below:

Height – The height of the bag
Width – The width of the bag
Length – The length of the bag
Weight – The weight the bag can carry
Weight1 – Weight the bag can carry after expansion

The company now wants to predict the cost they should set for a new variant of these kinds of bags.


data <- read.csv("R_340_Data_1.csv")

glimpse(data)

Rows: 159
Columns: 6
$ Cost     242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500,...
$ Weight   23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5,...
$ Weight1  25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7,...
$ Length   30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2,...
$ Height   11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.17...
$ Width    4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6...


summary(data)       # returns the statistical summary of the data columns

Cost            Weight         Weight1          Length     
 Min.   :   0.0   Min.   : 7.50   Min.   : 8.40   Min.   : 8.80  
 1st Qu.: 120.0   1st Qu.:19.05   1st Qu.:21.00   1st Qu.:23.15  
 Median : 273.0   Median :25.20   Median :27.30   Median :29.40  
 Mean   : 398.3   Mean   :26.25   Mean   :28.42   Mean   :31.23  
 3rd Qu.: 650.0   3rd Qu.:32.70   3rd Qu.:35.50   3rd Qu.:39.65  
 Max.   :1650.0   Max.   :59.00   Max.   :63.40   Max.   :68.00  
     Height           Width      
 Min.   : 1.728   Min.   :1.048  
 1st Qu.: 5.945   1st Qu.:3.386  
 Median : 7.786   Median :4.248  
 Mean   : 8.971   Mean   :4.417  
 3rd Qu.:12.366   3rd Qu.:5.585  
 Max.   :18.957   Max.   :8.142


dim(data)

159 6

STEP 3: Train Test Split


# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%)
parts = createDataPartition(data$Cost, p = .8, list = F)
train = data[parts, ]
test = data[-parts, ]

STEP 4: Building and optimising Ridge Regression

We will use caret package to perform Cross Validation and Hyperparameter tuning (alpha and lambda values) using grid search technique. First, we will use the trainControl() function to define the method of cross validation to be carried out and search type i.e. "grid" or "random". Then train the model using train() function with tuneGrid as one of the arguements.

Syntax: train(formula, data = , method = , trControl = , tuneGrid = )

where:

formula = y~x1+x2+x3+..., where y is the independent variable and x1,x2,x3 are the dependent variables
data = dataframe
method = Type of the model to be built ("glmnet" for ridge/lasso or Elastic regression)
trControl = Takes the control parameters. We will use trainControl function out here where we will specify the Cross validation technique.
tuneGrid = takes the tuning parameters and applies grid search CV on them


# specifying the CV technique which will be passed into the train() function later and number parameter is the "k" in K-fold cross validation
train_control = trainControl(method = "cv", number = 5, search = "grid")

## Customsing the tuning grid (ridge regression has alpha = 0)
ridgeGrid =  expand.grid(alpha = 0, lambda = c(seq(0.1, 1.5, by = 0.1), seq(2,5,1), seq(5,20,2)))

set.seed(50)

# training a ridge Regression model while tuning parameters
model = train(Cost~., data = train, method = "glmnet", trControl = train_control, tuneGrid = ridgeGrid)

# summarising the results
print(model)

129 samples
  5 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 103, 104, 103, 103, 103 
Resampling results across tuning parameters:

  lambda  RMSE      Rsquared   MAE    
   0.1    127.0819  0.8825179  98.1665
   0.2    127.0819  0.8825179  98.1665
   0.3    127.0819  0.8825179  98.1665
   0.4    127.0819  0.8825179  98.1665
   0.5    127.0819  0.8825179  98.1665
   0.6    127.0819  0.8825179  98.1665
   0.7    127.0819  0.8825179  98.1665
   0.8    127.0819  0.8825179  98.1665
   0.9    127.0819  0.8825179  98.1665
   1.0    127.0819  0.8825179  98.1665
   1.1    127.0819  0.8825179  98.1665
   1.2    127.0819  0.8825179  98.1665
   1.3    127.0819  0.8825179  98.1665
   1.4    127.0819  0.8825179  98.1665
   1.5    127.0819  0.8825179  98.1665
   2.0    127.0819  0.8825179  98.1665
   3.0    127.0819  0.8825179  98.1665
   4.0    127.0819  0.8825179  98.1665
   5.0    127.0819  0.8825179  98.1665
   7.0    127.0819  0.8825179  98.1665
   9.0    127.0819  0.8825179  98.1665
  11.0    127.0819  0.8825179  98.1665
  13.0    127.0819  0.8825179  98.1665
  15.0    127.0819  0.8825179  98.1665
  17.0    127.0819  0.8825179  98.1665
  19.0    127.0819  0.8825179  98.1665

Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 19.

Note: RMSE was used select the optimal model using the smallest value. And the final model has the lambda value of 19.

STEP 5: Make predictions on the final ridge regression model

We use our final ridge Regression model to make predictions on the testing data (unseen data) and predict the 'Cost' value and generate performance measures.


#use model to make predictions on test data
pred_y = predict(model, test)

# performance metrics on the test data
test_y = test[, 1]
mean((test_y - pred_y)^2) #mse - Mean Squared Error

caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error

13138.7222015375
114.624265326053

Final Coefficients are mentioned below:


data.frame( ridge = as.data.frame.matrix(coef(model$finalModel, model$finalModel$lambdaOpt))) %>%   rename(ridge = X1)

(Intercept)	-516.756313
Weight		8.391177
Weight1		7.564281
Length		5.726575
Height		9.096478
Width		49.89799

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Time Series Forecasting Project-Building ARIMA Model in Python

Build a time series ARIMA model in Python to forecast the use of arrival rate density to support staffing decisions at call centres.

View Project Details

Learn How to Build a Linear Regression Model in PyTorch

In this Machine Learning Project, you will learn how to build a simple linear regression model in PyTorch to predict the number of days subscribed.

View Project Details

Build a Multi Class Image Classification Model Python using CNN

This project explains How to build a Sequential Model that can perform Multi Class Image Classification in Python using CNN

View Project Details

A/B Testing Approach for Comparing Performance of ML Models

The objective of this project is to compare the performance of BERT and DistilBERT models for building an efficient Question and Answering system. Using A/B testing approach, we explore the effectiveness and efficiency of both models and determine which one is better suited for Q&A tasks.

View Project Details

How to create and optimize a baseline Ridge Regression model in R?

Recipe Objective

STEP 1: Importing Necessary Libraries

STEP 2: Read a csv file and explore the data

STEP 3: Train Test Split

STEP 4: Building and optimising Ridge Regression

STEP 5: Make predictions on the final ridge regression model

Ed Godalle

Relevant Projects

You might also like

Relevant Projects