How to create and optimize a baseline Decision Tree model for Regression in R?

This recipe helps you create and optimize a baseline Decision Tree model for Regression in R
Last Updated: 22 Jun 2021

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

Decision Tree is a supervised machine learning algorithm which can be used to perform both classification and regression on complex datasets. They are also known as Classification and Regression Trees (CART). Hence, it works for both continuous and categorical variables.

Important basic tree Terminology is as follows:

Root node: represents an entire popuplation or dataset which gets divided into two or more pure sets (also known as homogeneuos steps). It always contains a single input variable (x).
Leaf or terminal node: These nodes do not split further and contains the output variable

In this recipe, we will only focus on Regression Trees where the target variable is continuous in nature. The splits in these trees are based on minimising the Residual sum of squares of each groups formed. RSS is calculated by the predicted values is the mean response for the training observations within the jth group.

This recipe demonstrates the modelling and optimisation of a Regression Tree by using the sales data of bags.

STEP 1: Importing Necessary Libraries


library(caret)

library(tidyverse)		 # for data manipulation

STEP 2: Read a csv file and explore the data

The dataset attached contains the data of 160 different bags associated with ABC industries.

The bags have certain attributes which are described below:

Height – The height of the bag
Width – The width of the bag
Length – The length of the bag
Weight – The weight the bag can carry
Weight1 – Weight the bag can carry after expansion

The company now wants to predict the cost they should set for a new variant of these kinds of bags.


data <- read.csv("R_343_Data_1.csv")

glimpse(data)

Rows: 159
Columns: 6
$ Cost     242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500,...
$ Weight   23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5,...
$ Weight1  25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7,...
$ Length   30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2,...
$ Height   11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.17...
$ Width    4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6...


summary(data)       # returns the statistical summary of the data columns

Cost            Weight         Weight1          Length     
 Min.   :   0.0   Min.   : 7.50   Min.   : 8.40   Min.   : 8.80  
 1st Qu.: 120.0   1st Qu.:19.05   1st Qu.:21.00   1st Qu.:23.15  
 Median : 273.0   Median :25.20   Median :27.30   Median :29.40  
 Mean   : 398.3   Mean   :26.25   Mean   :28.42   Mean   :31.23  
 3rd Qu.: 650.0   3rd Qu.:32.70   3rd Qu.:35.50   3rd Qu.:39.65  
 Max.   :1650.0   Max.   :59.00   Max.   :63.40   Max.   :68.00  
     Height           Width      
 Min.   : 1.728   Min.   :1.048  
 1st Qu.: 5.945   1st Qu.:3.386  
 Median : 7.786   Median :4.248  
 Mean   : 8.971   Mean   :4.417  
 3rd Qu.:12.366   3rd Qu.:5.585  
 Max.   :18.957   Max.   :8.142


dim(data)

159 6

STEP 3: Train Test Split


# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%)
parts = createDataPartition(data$Cost, p = .8, list = F)
train = data[parts, ]
test = data[-parts, ]

STEP 4: Building and optimising Baseline Regression Tree

We will use caret package to perform Cross Validation and Hyperparameter tuning (max_depth) using grid search technique. First, we will use the trainControl() function to define the method of cross validation to be carried out and search type i.e. "grid" or "random". Then train the model using train() function with tuneGrid as one of the arguements.

Syntax: train(formula, data = , method = , trControl = , tuneGrid = )

where:

formula = y~x1+x2+x3+..., where y is the independent variable and x1,x2,x3 are the dependent variables
data = dataframe
method = Type of the model to be built ("rpart2" for CART)
trControl = Takes the control parameters. We will use trainControl function out here where we will specify the Cross validation technique.
tuneGrid = takes the tuning parameters and applies grid search CV on them


# specifying the CV technique which will be passed into the train() function later and number parameter is the "k" in K-fold cross validation
train_control = trainControl(method = "cv", number = 5, search = "grid")

## Customsing the tuning grid (ridge regression has alpha = 0)
Regress_Tree_Grid =  expand.grid(maxdepth = c(1,3,5,7,9))

set.seed(50)

# training a Regression model while tuning parameters (Method = "rpart2")
model = train(Cost~., data = train, method = "rpart2", trControl = train_control, tuneGrid = Regress_Tree_Grid)

# summarising the results
print(model)

CART 

129 samples
  5 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 103, 104, 103, 103, 103 
Resampling results across tuning parameters:

  maxdepth  RMSE      Rsquared   MAE      
  1         207.8910  0.6593256  157.83357
  3         142.9742  0.8496021  103.25668
  5         130.2922  0.8751730   87.63598
  7         127.8424  0.8794063   85.60232
  9         127.8424  0.8794063   85.60232

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was maxdepth = 7.

Note: RMSE was used select the optimal model using the smallest value. And the final model has the max depth of 7.

STEP 5: Make predictions on the final Regression Tree model

We use our final Regression Tree model to make predictions on the testing data (unseen data) and predict the 'Cost' value and generate performance measures.


#use model to make predictions on test data
pred_y = predict(model, test)

# performance metrics on the test data
test_y = test[, 1]
mean((test_y - pred_y)^2) #mse - Mean Squared Error

caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error

18133.0077888765
134.658857075487

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More