How to build classification trees in R?

This recipe helps you build classification trees in R

Recipe Objective

Decision Tree is a supervised machine learning algorithm which can be used to perform both classification and regression on complex datasets. They are also known as Classification and Regression Trees (CART). Hence, it works for both continuous and categorical variables.

Important basic tree Terminology is as follows: ​

  1. Root node: represents an entire popuplation or dataset which gets divided into two or more pure sets (also known as homogeneuos steps). It always contains a single input variable (x).
  2. Leaf or terminal node: These nodes do not split further and contains the output variable

In this recipe, we will only focus on Classification Trees where the target variable is categorical in nature. The splits in these trees are based on the homogeneity of the groups formed. The homogeinity or impurity in the data is quantified by computing metrics like Entropy, Information Gain and Gini Index. ​

List of Classification Algorithms in Machine Learning

Most commonly used Metric is Information gain. It is the measure to quantify how much information a feature variable provides about the class. ​

This recipe demonstrates the modelling of a Classification Tree, we use a famous dataset by National institute of Diabetes and Digestive and Kidney Diseases. ​

STEP 1: Importing Necessary Libraries

# For data manipulation library(tidyverse) # For Decision Tree algorithm library(rpart) # for plotting the decision Tree install.packages("rpart.plot") library(rpart.plot) # Install readxl R package for reading excel sheets install.packages("readxl") library("readxl")

STEP 2: Loading the Train and Test Dataset

Loading the test and train dataset sepearately. Here Train and test are split in 80/20 proportion respectively.

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome). ​

Independent Variables: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age

Dependent Variables: Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes')

# calling the function read_excel from the readxl library train = read_excel('R_256_df_train_regression.xlsx') test = read_excel('R_256_df_test_regression.xlsx') # gives the number of observations and variables involved with its brief description glimpse(train)

Observations: 613
Variables: 9
$ Pregnancies               6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, ...
$ Glucose                   148, 85, 183, 89, 137, 116, 78, 115, 197, ...
$ BloodPressure             72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92,...
$ SkinThickness             35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, ...
$ Insulin                   0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, ...
$ BMI                       33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, ...
$ DiabetesPedigreeFunction  0.627, 0.351, 0.672, 0.167, 2.288, 0.201, ...
$ Age                       50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30...
$ Outcome                   1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...

# gives the number of observations and variables involved with its brief description glimpse(test)

Observations: 155
Variables: 9
$ Pregnancies               6, 11, 3, 6, 2, 9, 0, 2, 2, 6, 0, 2, 4, 0,...
$ Glucose                   105, 138, 106, 117, 68, 112, 119, 112, 92,...
$ BloodPressure             80, 74, 72, 96, 62, 82, 0, 86, 76, 94, 70,...
$ SkinThickness             28, 26, 0, 0, 13, 24, 0, 42, 20, 0, 27, 0,...
$ Insulin                   0, 144, 0, 0, 15, 0, 0, 160, 0, 0, 115, 0,...
$ BMI                       32.5, 36.1, 25.8, 28.7, 20.1, 28.2, 32.4, ...
$ DiabetesPedigreeFunction  0.878, 0.557, 0.207, 0.157, 0.257, 1.282, ...
$ Age                       26, 50, 27, 30, 23, 50, 24, 28, 28, 45, 21...
$ Outcome                   0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function.

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the independent variables in train dataset train_scaled = scale(train[2:6]) # using cbind() function to add a new column Outcome to the scaled independent values train_scaled = data.frame(cbind(train_scaled, Outcome = train$Outcome)) train_scaled %>% head()

Weight		Weight1		Length		Height		Width		Outcome
-0.33379271	-0.3132781	-0.08858827	0.4095324	-0.42466337	242
-0.22300101	-0.1970948	0.04945726	0.6459374	-0.22972408	290
-0.23684997	-0.1712763	0.03795346	0.6207701	0.03681581	340
0.09552513	0.1514550	0.31404453	0.7075012	-0.12740825	363
0.12322305	0.1514550	0.37156350	0.6370722	0.33570907	430
0.16476994	0.2418198	0.45209006	0.9223343	0.19469206	450

# scaling the independent variables in train dataset test_scaled = scale(test[2:6]) # using cbind() function to add a new column Outcome to the scaled independent values test_scaled = data.frame(cbind(test_scaled, Outcome = test$Outcome)) test_scaled %>% head()

Weight		Weight1		Length		Height		Width		Outcome
0.72483012	0.72445274	0.69959684	2.15715925	1.87080937	1000
0.07204194	0.08459639	0.09077507	0.03471101	-0.06904068	200
0.17201851	0.17756697	0.24556027	0.07758442	0.29059599	300
0.23082825	0.23225555	0.29715533	0.14769072	0.39466263	300
0.35432872	0.35803927	0.34875040	0.25564092	0.22707121	300
0.39549554	0.39632128	0.38486694	0.56280832	0.48296300	430

STEP 4: Creation of Decision Tree Classifier model using training set

We use rpart() function to fit the model.

Syntax: rpart(formula, data = , method = '')

Where:

  1. Formula of the Decision Trees: Outcome ~. where Outcome is dependent variable and . represents all other independent variables
  2. data = train_scaled
  3. method = 'class' (to Fit a binary classification model)

# creation of an object 'model' using rpart function model = rpart(Outcome~., data = train_scaled, method = 'class')

Using rpart.plot() function to plot the decision tree model

rpart.plot(model)

STEP 5: Predict using Test Dataset

We use Predict() function to do the same.

Syntax: predict(fitted_model, df, type = '')

where:

  1. fitted_model = model fitted by train dataset
  2. df = test dataset
  3. type = 'class' fpr classification

predict_test = predict(model, test_scaled, type = "class") predict_test %>% head()

1 0
2 1
3 0
4 0
5 0
6 1  
​

STEP 6: Creation of confusion matrix

we use table() function to create the confusion matrix between actuals and predicted of Outcome Column

confusion_matrix = table(test_scaled$Outcome, predict_test) confusion_matrix

predict_test
     0  1
  0 84 16
  1 15 40
​

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Learn to Build Generative Models Using PyTorch Autoencoders
In this deep learning project, you will learn how to build a Generative Model using Autoencoders in PyTorch

Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

Build a Graph Based Recommendation System in Python-Part 2
In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.

Detectron2 Object Detection and Segmentation Example Python
Object Detection using Detectron2 - Build a Dectectron2 model to detect the zones and inhibitions in antibiogram images.

Learn Object Tracking (SOT, MOT) using OpenCV and Python
Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

Build a Text Classification Model with Attention Mechanism NLP
In this NLP Project, you will learn to build a multi class text classification model with attention mechanism.

MLOps using Azure Devops to Deploy a Classification Model
In this MLOps Azure project, you will learn how to deploy a classification machine learning model to predict the customer's license status on Azure through scalable CI/CD ML pipelines.

MLOps Project to Deploy Resume Parser Model on Paperspace
In this MLOps project, you will learn how to deploy a Resume Parser Streamlit Application on Paperspace Private Cloud.

AWS MLOps Project to Deploy a Classification Model [Banking]
In this AWS MLOps project, you will learn how to deploy a classification model using Flask on AWS.

Deep Learning Project for Text Detection in Images using Python
CV2 Text Detection Code for Images using Python -Build a CRNN deep learning model to predict the single-line text in a given image.