How to tune Hyper parameters using Grid Search in R?

This recipe helps you tune Hyper parameters using Grid Search in R

Recipe Objective

When we train a model, the best parameters are determined for each independent variable. For example, in linear reggression modelling, the coefficients of each independent variable is considered as a parameter i.e. they are found during the training process.

On the hand, Hyperparameters are are set by the user before training and are independent of the training process. For example, depth of a Decision Tree. These hyper parameters affects the performance as well as the parameters of the model. Hence, they need to be optimised. There are two ways to carry out Hyperparameter tuning: ​

  1. Grid Search: This technique generates evenly spaced values for each hyperparameters and then uses Cross validation to find the optimum values.
  2. Random Search: This technique generates random values for each hyperparameter being tested and then uses Cross validation to find the optimum values.

In this recipe, we will discuss how to build and optimise size of the tree in XGBoost using hyperparameter tuning using Grid Search. ​

Recently, researchers and enthusiasts have started using ensemble techniques like XGBoost to win data science competitions and hackathons. It outperforms algorithms such as Random Forest and Gadient Boosting in terms of speed as well as accuracy when performed on structured data. ​

XGBoost uses ensemble model which is based on Decision tree. A simple decision tree is considered to be a weak learner. The algorithm build sequential decision trees were each tree corrects the error occuring in the previous one until a condition is met. ​

STEP 1: Importing Necessary Libraries

install.packages('xgboost') # for fitting the xgboost model install.packages('caret') # for general data preparation and model fitting library(xgboost) library(caret) library(tidyverse) # for data manipulation

STEP 2: Read a csv file and explore the data

The dataset attached contains the data of 160 different bags associated with ABC industries.

The bags have certain attributes which are described below: ​

  1. Height – The height of the bag
  2. Width – The width of the bag
  3. Length – The length of the bag
  4. Weight – The weight the bag can carry
  5. Weight1 – Weight the bag can carry after expansion

The company now wants to predict the cost they should set for a new variant of these kinds of bags. ​

data <- read.csv("R_338_Data_1.csv") glimpse(data)

Rows: 159
Columns: 6
$ Cost     242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500,...
$ Weight   23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5,...
$ Weight1  25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7,...
$ Length   30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2,...
$ Height   11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.17...
$ Width    4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6...

summary(data) # returns the statistical summary of the data columns

Cost            Weight         Weight1          Length     
 Min.   :   0.0   Min.   : 7.50   Min.   : 8.40   Min.   : 8.80  
 1st Qu.: 120.0   1st Qu.:19.05   1st Qu.:21.00   1st Qu.:23.15  
 Median : 273.0   Median :25.20   Median :27.30   Median :29.40  
 Mean   : 398.3   Mean   :26.25   Mean   :28.42   Mean   :31.23  
 3rd Qu.: 650.0   3rd Qu.:32.70   3rd Qu.:35.50   3rd Qu.:39.65  
 Max.   :1650.0   Max.   :59.00   Max.   :63.40   Max.   :68.00  
     Height           Width      
 Min.   : 1.728   Min.   :1.048  
 1st Qu.: 5.945   1st Qu.:3.386  
 Median : 7.786   Median :4.248  
 Mean   : 8.971   Mean   :4.417  
 3rd Qu.:12.366   3rd Qu.:5.585  
 Max.   :18.957   Max.   :8.142   

dim(data)

159 6

STEP 3: Train Test Split

# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%) parts = createDataPartition(data$Cost, p = .8, list = F) train = data[parts, ] test = data[-parts, ]

STEP 4: Building and optimising xgboost model using Hyperparameter tuning

We will use caret package to perform Cross Validation and Hyperparameter tuning (nround- Number of trees and max_depth) using grid search technique. First, we will use the trainControl() function to define the method of cross validation to be carried out and search type i.e. "grid" or "random". Then train the model using train() function with tuneGrid as one of the arguements.

Syntax: train(formula, data = , method = , trControl = , tuneGrid = )

where:

  1. formula = y~x1+x2+x3+..., where y is the independent variable and x1,x2,x3 are the dependent variables
  2. data = dataframe
  3. method = Type of the model to be built
  4. trControl = Takes the control parameters. We will use trainControl function out here where we will specify the Cross validation technique.
  5. tuneGrid = takes the tuning parameters and applies grid search CV on them

# specifying the CV technique which will be passed into the train() function later and number parameter is the "k" in K-fold cross validation train_control = trainControl(method = "cv", number = 5, search = "grid") set.seed(50) # Customsing the tuning grid gbmGrid <- expand.grid(max_depth = c(3, 5, 7), nrounds = (1:10)*50, # number of trees # default values below eta = 0.3, gamma = 0, subsample = 1, min_child_weight = 1, colsample_bytree = 0.6) # training a XGboost Regression tree model while tuning parameters model = train(Cost~., data = train, method = "xgbTree", trControl = train_control, tuneGrid = gbmGrid) # summarising the results print(model)

129 samples
  5 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 103, 104, 103, 103, 103 
Resampling results across tuning parameters:

  max_depth  nrounds  RMSE      Rsquared   MAE     
  3           50      61.87601  0.9681694  39.97181
  3          100      61.01752  0.9684654  39.46483
  3          150      60.79834  0.9686008  39.27449
  3          200      60.75953  0.9685933  39.19572
  3          250      60.77062  0.9685805  39.20809
  3          300      60.76614  0.9685810  39.20466
  3          350      60.76551  0.9685806  39.19794
  3          400      60.76258  0.9685815  39.19586
  3          450      60.76128  0.9685823  39.19451
  3          500      60.76116  0.9685822  39.19413
  5           50      60.46333  0.9692231  39.54743
  5          100      60.44432  0.9691914  39.52543
  5          150      60.44781  0.9691819  39.53476
  5          200      60.44717  0.9691816  39.53404
  5          250      60.44695  0.9691817  39.53376
  5          300      60.44695  0.9691817  39.53376
  5          350      60.44695  0.9691817  39.53376
  5          400      60.44695  0.9691817  39.53376
  5          450      60.44695  0.9691817  39.53376
  5          500      60.44695  0.9691817  39.53376
  7           50      62.56534  0.9667535  40.46075
  7          100      62.52348  0.9667409  40.45755
  7          150      62.52303  0.9667407  40.45686
  7          200      62.52303  0.9667407  40.45685
  7          250      62.52303  0.9667407  40.45685
  7          300      62.52303  0.9667407  40.45685
  7          350      62.52303  0.9667407  40.45685
  7          400      62.52303  0.9667407  40.45685
  7          450      62.52303  0.9667407  40.45685
  7          500      62.52303  0.9667407  40.45685

Tuning parameter 'eta' was held constant at a value of 0.3
Tuning

Tuning parameter 'min_child_weight' was held constant at a value of 1

Tuning parameter 'subsample' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 100, max_depth = 5, eta
 = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
 = 1.

Note: RMSE was used select the optimal model using the smallest value. And the final model consists of 100 trees and depth of 5.

STEP 5: Make predictions on the final xgboost model

We use our final xgboost model to make predictions on the testing data (unseen data) and predict the 'Cost' value and generate performance measures.

#use model to make predictions on test data pred_y = predict(model, test) # performance metrics on the test data test_y = test[, 1] mean((test_y - pred_y)^2) #mse - Mean Squared Error caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error

2078.97917959085
45.5958241464156

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Azure Text Analytics for Medical Search Engine Deployment
Microsoft Azure Project - Use Azure text analytics cognitive service to deploy a machine learning model into Azure Databricks

Classification Projects on Machine Learning for Beginners - 2
Learn to implement various ensemble techniques to predict license status for a given business.

Learn Object Tracking (SOT, MOT) using OpenCV and Python
Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

Multilabel Classification Project for Predicting Shipment Modes
Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Learn to Build an End-to-End Machine Learning Pipeline - Part 1
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, addressing a major challenge in the logistics industry.

MLOps AWS Project on Topic Modeling using Gunicorn Flask
In this project we will see the end-to-end machine learning development process to design, build and manage reproducible, testable, and evolvable machine learning models by using AWS

Build a Multi ClassText Classification Model using Naive Bayes
Implement the Naive Bayes Algorithm to build a multi class text classification model in Python.

Build Classification Algorithms for Digital Transformation[Banking]
Implement a machine learning approach using various classification techniques in Python to examine the digitalisation process of bank customers.

NLP Project to Build a Resume Parser in Python using Spacy
Use the popular Spacy NLP python library for OCR and text classification to build a Resume Parser in Python.