How to implement Ridge regression in R

In this recipe, we shall learn how to use ridge regression in R. It is a model tuning technique that can be used to analyze data that consists of multicollinearity.
Last Updated: 05 Sep 2022

Get access to Data Science projects View all Data Science projects

DATA SCIENCE PROJECTS IN R DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to implement Ridge regression in R?

Ridge regression is a model tuning technique that can be used to analyze data that consists of multicollinearity. It uses the L2 regularization technique. When there is a problem with multicollinearity, least-squares are unbiased, and variances are high, the projected values are far from the actual values. Coefficients in the Ridge regression model are estimated using the ridge estimator, and the model is biased but has a lower variance than an OLS estimator. The steps to implement ridge regression in R are as follows-

Learn How to use XLNet for Text Classification

Recipe Objective: How to implement Ridge regression in R?

Step 1: Load the required packages

#importing required packages library(caret) library(glmnet) library(MASS)

Step 2: Load the dataset

Boston is an inbuilt dataset in R which contains Housing data for 506 census tracts of Boston from the 1970 census.
indus- the proportion of non-retail business acres per town
chas- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox- nitric oxides concentration (parts per 10 million)
rm- the average number of rooms per dwelling
age- the proportion of owner-occupied units built before 1940
dis- weighted distances to five Boston employment centers
rad- index of accessibility to radial highways
tax- full-value property-tax rate per USD 10,000
ptratio pupil-teacher ratio by town
black- 1000(B - 0.63)^2 where B is the proportion of blacks by town
lstat- the percentage of the lower status of the population
medv- median value of owner-occupied homes in USD 1000's

#loading the dataset data <- Boston head(data)

     crim zn indus chas   nox    rm  age    dis rad tax
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222
  ptratio  black lstat medv
1    15.3 396.90  4.98 24.0
2    17.8 396.90  9.14 21.6
3    17.8 392.83  4.03 34.7
4    18.7 394.63  2.94 33.4
5    18.7 396.90  5.33 36.2
6    18.7 394.12  5.21 28.7

Step 3: Check the structure of the dataset

#structure head(data) str(data)

'data.frame':	506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

All the columns are int or numeric type

Step 4: Train-Test split

#train-test split set.seed(222) ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3)) train <- data[ind==1,] head(data) test <- data[ind==2,]

Step 5: Create custom Control Parameters

#creating custom Control Parameters custom <- trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = TRUE)

Step 6: Model Fitting

#fitting Ridge Regression model set.seed(1234) ridge <- train(medv~.,train, method="glmnet", tuneGrid=expand.grid(alpha=0, lambda=seq(0.0001,1,length=5)), trControl=custom) ridge

Output:
glmnet 

353 samples
13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 316, 318, 318, 319, 317, 318, ... 
Resampling results across tuning parameters:

  lambda    RMSE      Rsquared   MAE     
  0.000100  4.242204  0.7782278  3.008339
  0.250075  4.242204  0.7782278  3.008339
  0.500050  4.242204  0.7782278  3.008339
  0.750025  4.248536  0.7779462  3.012397
  1.000000  4.265479  0.7770264  3.023091

Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the
 smallest value.
The final values used for the model were alpha = 0 and
 lambda = 0.50005.

Step 7: Check RMSE value

#mean validation score mean(ridge$resample$RMSE)

[1] 4.242204

Step 8: Plots

#plotting the model plot(ridge, main = "Ridge Regression") #plotting important variables plot(varImp(ridge,scale=TRUE))

nox, rm, and chas were the top three most important variables.

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More