How to work with cross validation in R?

This recipe helps you work with cross validation in R
Last Updated: 22 Jun 2021

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

The major challenge when building a model is to make it work accurately on unseen data. Cross-Validation is one of the techniques which can be used to check the effectiveness of the model. It reserves a portion of the data which is not used while training the model. It is rather used later as unseen data to test/validate the model by giving us the prediction error.

The three most common Cross-Validation Techniques are:

Leave one out cross-validation (LOOC)
K-fold cross-validation
repeated k-fold cross validation.

In this recipe, we will learn how to use perform Leave One Out Cross Validation in a linear regression model R.

Leave One Out Cross Validation technique splits the dataset into two parts similar to Validation Set approach. In this technique, the following steps takes place:

One data point is randonly reserved.
Training occurs on the rest N-1 model.
Testing the model against that one left out datapoint.
The above three steps are repeated until the every dataset point is used to train and test the model.
The final prediction error obtained is the average of the errors in eevry case.

One of the major drawbacks of this model is that it leads to extensive computation.

STEP 1: Importing Necessary Libraries


library(caret)

library(tidyverse)		 # for data manipulation

STEP 2: Read a csv file and explore the data

The dataset attached contains the data of 160 different bags associated with ABC industries.

The bags have certain attributes which are described below:

Height – The height of the bag
Width – The width of the bag
Length – The length of the bag
Weight – The weight the bag can carry
Weight1 – Weight the bag can carry after expansion

The company now wants to predict the cost they should set for a new variant of these kinds of bags.


data <- read.csv("R_302_Data_1.csv")

glimpse(data)

Rows: 159
Columns: 6
$ Cost     242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500,...
$ Weight   23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5,...
$ Weight1  25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7,...
$ Length   30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2,...
$ Height   11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.17...
$ Width    4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6...


summary(data)       # returns the statistical summary of the data columns

Cost            Weight         Weight1          Length     
 Min.   :   0.0   Min.   : 7.50   Min.   : 8.40   Min.   : 8.80  
 1st Qu.: 120.0   1st Qu.:19.05   1st Qu.:21.00   1st Qu.:23.15  
 Median : 273.0   Median :25.20   Median :27.30   Median :29.40  
 Mean   : 398.3   Mean   :26.25   Mean   :28.42   Mean   :31.23  
 3rd Qu.: 650.0   3rd Qu.:32.70   3rd Qu.:35.50   3rd Qu.:39.65  
 Max.   :1650.0   Max.   :59.00   Max.   :63.40   Max.   :68.00  
     Height           Width      
 Min.   : 1.728   Min.   :1.048  
 1st Qu.: 5.945   1st Qu.:3.386  
 Median : 7.786   Median :4.248  
 Mean   : 8.971   Mean   :4.417  
 3rd Qu.:12.366   3rd Qu.:5.585  
 Max.   :18.957   Max.   :8.142


dim(data)

159 6

STEP 3: Performing LOOCV

We will use caret package to perform Cross Validation. Firstly, we will use the trainControl() function to define the method of cross validation to be carried out and then use train() function.

Syntax: train(formula, data = , method = , trControl = , tuneGrid = )

where:

formula = y~x1+x2+x3+..., where y is the independent variable and x1,x2,x3 are the dependent variables
data = dataframe
method = Type of the model to be built
trControl = Takes the control parameters. We will use trainControl function out here where we will specify the Cross validation technique.


# specifying the CV technique which will be passed into the train() function later
train_control = trainControl(method = "LOOCV")

# training a linear regression model with LOOCV
model = train(Cost~., data = data_1, method = "lm", trControl = train_control)

# summarising the results
print(model)

Linear Regression 

159 samples
  5 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 158, 158, 158, 158, 158, 158, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  128.9897  0.8693613  96.34783

Tuning parameter 'intercept' was held constant at a value of TRUE

Note: The averaged RMSE, R-squared and MAE mentioned above is the cross validation error.

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More