How to optimise number of trees in XGBoost in R?

This recipe helps you optimise number of trees in XGBoost in R
Last Updated: 22 Jun 2021

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

Classification and regression are supervised learning models that can be solved using algorithms like linear regression / logistics regression, decision tree, etc. But these are not competitive in terms of producing a good prediction accuracy.Ensemble techniques, on the other hand, create multiple models and combine them into one to produce effective results.

Bagging, boosting, random forest, are different types of ensemble techniques. Boosting is a sequential ensemble technique in which the model is improved using the information from previously grown weaker models. This process is continued for multiple iterations until a final model is built which will predict a more accurate outcome.

There are 3 types of boosting techniques:

Adaboost
Gradient Descent.
Xgboost

Recently, researchers and enthusiasts have started using ensemble techniques like XGBoost to win data science competitions and hackathons. It outperforms algorithms such as Random Forest and Gadient Boosting in terms of speed as well as accuracy when performed on structured data.

XGBoost uses ensemble model which is based on Decision tree. A simple decision tree is considered to be a weak learner. The algorithm build sequential decision trees were each tree corrects the error occuring in the previous one until a condition is met.

In this recipe, we will discuss how to build and optimise number of trees in XGBoost.

STEP 1: Importing Necessary Libraries


library(caret) 			# for general data preparation and model fitting

library(rpart.plot)

library(tidyverse)

STEP 2: Read a csv file and explore the data

The dataset attached contains the data of 160 different bags associated with ABC industries.

The bags have certain attributes which are described below:

Height – The height of the bag
Width – The width of the bag
Length – The length of the bag
Weight – The weight the bag can carry
Weight1 – Weight the bag can carry after expansion

The company now wants to predict the cost they should set for a new variant of these kinds of bags.


data <- read.csv("R_359_Data_1.csv")

glimpse(data)

Rows: 159
Columns: 6
$ Cost     242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500,...
$ Weight   23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5,...
$ Weight1  25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7,...
$ Length   30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2,...
$ Height   11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.17...
$ Width    4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6...


summary(data)       # returns the statistical summary of the data columns

Cost            Weight         Weight1          Length     
 Min.   :   0.0   Min.   : 7.50   Min.   : 8.40   Min.   : 8.80  
 1st Qu.: 120.0   1st Qu.:19.05   1st Qu.:21.00   1st Qu.:23.15  
 Median : 273.0   Median :25.20   Median :27.30   Median :29.40  
 Mean   : 398.3   Mean   :26.25   Mean   :28.42   Mean   :31.23  
 3rd Qu.: 650.0   3rd Qu.:32.70   3rd Qu.:35.50   3rd Qu.:39.65  
 Max.   :1650.0   Max.   :59.00   Max.   :63.40   Max.   :68.00  
     Height           Width      
 Min.   : 1.728   Min.   :1.048  
 1st Qu.: 5.945   1st Qu.:3.386  
 Median : 7.786   Median :4.248  
 Mean   : 8.971   Mean   :4.417  
 3rd Qu.:12.366   3rd Qu.:5.585  
 Max.   :18.957   Max.   :8.142


dim(data)

159 6

STEP 3: Train Test Split


# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%)
parts = createDataPartition(data$Cost, p = .8, list = F)
train = data[parts, ]
test = data[-parts, ]

STEP 4: Building and Optimising the number of trees in xgboost

We will use caret package to perform Cross Validation and Hyperparameter tuning (nround- Number of trees and max_depth) using grid search technique. Firstly, we will use the trainControl() function to define the method of cross validation to be carried out and then use train() function.

Syntax: train(formula, data = , method = , trControl = , tuneGrid = )

where:

formula = y~x1+x2+x3+..., where y is the independent variable and x1,x2,x3 are the dependent variables
data = dataframe
method = Type of the model to be built
trControl = Takes the control parameters. We will use trainControl function out here where we will specify the Cross validation technique.
tuneGrid = takes the tuning parameters and applies grid search CV on them


set.seed(50)

# specifying the CV technique which will be passed into the train() function later and number parameter is the "k" in K-fold cross validation
train_control = trainControl(method = "cv", number = 5)


# Customsing the tuning grid
gbmGrid <-  expand.grid(max_depth = c(3, 5, 7), 
                        nrounds = (1:10)*50,    # number of trees
                        # default values below
                        eta = 0.3,
                        gamma = 0,
                        subsample = 1,
                        min_child_weight = 1,
                        colsample_bytree = 0.6)

# training a XGboost Regression tree model while tuning parameters
model = train(Cost~., data = train, method = "xgbTree", trControl = train_control, tuneGrid = gbmGrid)

# summarising the results
print(model)

eXtreme Gradient Boosting 

129 samples
  5 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 103, 104, 103, 103, 103 
Resampling results across tuning parameters:

  max_depth  nrounds  RMSE      Rsquared   MAE     
  3           50      66.59079  0.9680913  44.08832
  3          100      66.42035  0.9678767  43.76808
  3          150      66.48990  0.9677309  43.93413
  3          200      66.43748  0.9677675  43.94378
  3          250      66.43912  0.9677689  43.96677
  3          300      66.44803  0.9677594  43.97885
  3          350      66.45109  0.9677553  43.98506
  3          400      66.45156  0.9677535  43.98565
  3          450      66.45176  0.9677532  43.98631
  3          500      66.45210  0.9677528  43.98681
  5           50      58.84186  0.9729939  39.82767
  5          100      58.87888  0.9729470  39.91885
  5          150      58.88364  0.9729429  39.93110
  5          200      58.88457  0.9729421  39.93284
  5          250      58.88449  0.9729422  39.93282
  5          300      58.88449  0.9729422  39.93282
  5          350      58.88450  0.9729422  39.93283
  5          400      58.88450  0.9729422  39.93281
  5          450      58.88450  0.9729422  39.93281
  5          500      58.88450  0.9729422  39.93281
  7           50      61.11796  0.9701455  41.64511
  7          100      61.12728  0.9701290  41.68571
  7          150      61.12699  0.9701292  41.68622
  7          200      61.12699  0.9701293  41.68622
  7          250      61.12699  0.9701293  41.68622
  7          300      61.12699  0.9701292  41.68622
  7          350      61.12699  0.9701292  41.68622
  7          400      61.12699  0.9701292  41.68622
  7          450      61.12699  0.9701292  41.68622
  7          500      61.12699  0.9701292  41.68622

Tuning parameter 'eta' was held constant at a value of 0.3
Tuning

Tuning parameter 'min_child_weight' was held constant at a value of 1

Tuning parameter 'subsample' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 50, max_depth = 5, eta
 = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
 = 1.


Note: RMSE was used select the optimal model using the smallest value. And the final model consists of 50 trees.

STEP 5: Make predictions on the final xgboost model

We use our final xgboost model to make predictions on the testing data (unseen data) and predict the 'Cost' value and generate performance measures.


#use model to make predictions on test data
pred_y = predict(model, test)

# performance metrics on the test data
test_y = test[, 1]
mean((test_y - pred_y)^2) #mse - Mean Squared Error

caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error

1255.73244466726
35.4363153370558

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More