How to build regression trees in R?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

How to build regression trees in R?

This recipe helps you build regression trees in R

0

Recipe Objective

Decision Tree is a supervised machine learning algorithm which can be used to perform both classification and regression on complex datasets. They are also known as Classification and Regression Trees (CART). Hence, it works for both continuous and categorical variables.

Important basic tree Terminology is as follows: ​

1. Root node: represents an entire popuplation or dataset which gets divided into two or more pure sets (also known as homogeneuos steps). It always contains a single input variable (x).
2. Leaf or terminal node: These nodes do not split further and contains the output variable

In this recipe, we will only focus on Regression Trees where the target variable is continuous in nature. The splits in these trees are based on minimising the Residual sum of squares of each groups formed. RSS is calculated by the predicted values is the mean response for the training observations within the jth group. ​

This recipe demonstrates the modelling of a Regression Tree, we use a famous dataset by National institute of Diabetes and Digestive and Kidney Diseases. ​

STEP 1: Importing Necessary Libraries

``` # For data manipulation library(tidyverse) # For Decision Tree algorithm library(rpart) # for plotting the decision Tree install.packages("rpart.plot") library(rpart.plot) # Install readxl R package for reading excel sheets install.packages("readxl") library("readxl") ```

Loading the test and train dataset sepearately. Here Train and test are split in 80/20 proportion respectively.

Dataset description: The company wants to predict the cost they should set for a new variant of the kinds of bags based on the attributes mentioned below using the following variables: ​

1. Height – The height of the bag
2. Width – The width of the bag
3. Length – The length of the bag
4. Weight – The weight the bag can carry
5. Weight1 – Weight the bag can carry after expansion
``` # calling the function read_excel from the readxl library train = read_excel('R_255_df_train_regression.xlsx') test = read_excel('R_255_df_test_regression.xlsx') # gives the number of observations and variables involved with its brief description glimpse(train) ```
```Rows: 127
Columns: 6
\$ Cost     242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500,...
\$ Weight   23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5,...
\$ Weight1  25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7,...
\$ Length   30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2,...
\$ Height   11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.17...
\$ Width    4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6...
```
``` # gives the number of observations and variables involved with its brief description glimpse(test) ```
```Rows: 32
Columns: 6
\$ Cost     1000.0, 200.0, 300.0, 300.0, 300.0, 430.0, 345.0, 456.0, 51...
\$ Weight   41.1, 30.0, 31.7, 32.7, 34.8, 35.5, 36.0, 40.0, 40.0, 40.1,...
\$ Weight1  44.0, 32.3, 34.0, 35.0, 37.3, 38.0, 38.5, 42.5, 42.5, 43.0,...
\$ Length   46.6, 34.8, 37.8, 38.8, 39.8, 40.5, 41.0, 45.5, 45.5, 45.8,...
\$ Height   12.4888, 5.5680, 5.7078, 5.9364, 6.2884, 7.2900, 6.3960, 7....
\$ Width    7.5958, 3.3756, 4.1580, 4.3844, 4.0198, 4.5765, 3.9770, 4.3...
```

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function.

Note: Scaling is an important pre-modelling step which has to be mandatory

``` # scaling the independent variables in train dataset train_scaled = scale(train[2:6]) # using cbind() function to add a new column Outcome to the scaled independent values train_scaled = data.frame(cbind(train_scaled, Outcome = train\$Cost)) train_scaled %>% head() ```
```Weight		Weight1		Length		Height		Width		Outcome
-0.33379271	-0.3132781	-0.08858827	0.4095324	-0.42466337	242
-0.22300101	-0.1970948	0.04945726	0.6459374	-0.22972408	290
-0.23684997	-0.1712763	0.03795346	0.6207701	0.03681581	340
0.09552513	0.1514550	0.31404453	0.7075012	-0.12740825	363
0.12322305	0.1514550	0.37156350	0.6370722	0.33570907	430
0.16476994	0.2418198	0.45209006	0.9223343	0.19469206	450
```
``` # scaling the independent variables in train dataset test_scaled = scale(test[2:6]) # using cbind() function to add a new column Outcome to the scaled independent values test_scaled = data.frame(cbind(test_scaled, Outcome = test\$Cost)) test_scaled %>% head() ```
```Weight		Weight1		Length		Height		Width		Outcome
0.72483012	0.72445274	0.69959684	2.15715925	1.87080937	1000
0.07204194	0.08459639	0.09077507	0.03471101	-0.06904068	200
0.17201851	0.17756697	0.24556027	0.07758442	0.29059599	300
0.23082825	0.23225555	0.29715533	0.14769072	0.39466263	300
0.35432872	0.35803927	0.34875040	0.25564092	0.22707121	300
0.39549554	0.39632128	0.38486694	0.56280832	0.48296300	430
```

STEP 4: Creation of Decision Tree Regressor model using training set

We use rpart() function to fit the model.

Syntax: rpart(formula, data = , method = '')

Where:

1. Formula of the Decision Trees: Outcome ~. where Outcome is dependent variable and . represents all other independent variables
2. data = train_scaled
3. method = 'anova' (to Fit a regression model)
``` # creation of an object 'model' using rpart function model = rpart(Outcome~., data = train_scaled, method = 'anova') ```

Using rpart.plot() function to plot the decision tree model

``` rpart.plot(model) ```

STEP 5: Predict using Test Dataset

We use Predict() function to do the same.

Syntax: predict(fitted_model, df, type = '')

where:

1. fitted_model = model fitted by train dataset
2. df = test dataset
``` predict_test = predict(model, test_scaled) predict_test %>% head() ```
```1 700.909090909091
2 316.625
3 316.625
4 316.625
5 495.9
6 495.9
​```

Relevant Projects

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

Topic modelling using Kmeans clustering to group customer reviews
In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Machine Learning or Predictive Models in IoT - Energy Prediction Use Case
In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.