How to build classification trees in R?

This recipe helps you build classification trees in R
Last Updated: 26 Dec 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

Decision Tree is a supervised machine learning algorithm which can be used to perform both classification and regression on complex datasets. They are also known as Classification and Regression Trees (CART). Hence, it works for both continuous and categorical variables.

Important basic tree Terminology is as follows:

Root node: represents an entire popuplation or dataset which gets divided into two or more pure sets (also known as homogeneuos steps). It always contains a single input variable (x).
Leaf or terminal node: These nodes do not split further and contains the output variable

In this recipe, we will only focus on Classification Trees where the target variable is categorical in nature. The splits in these trees are based on the homogeneity of the groups formed. The homogeinity or impurity in the data is quantified by computing metrics like Entropy, Information Gain and Gini Index.

List of Classification Algorithms in Machine Learning

Most commonly used Metric is Information gain. It is the measure to quantify how much information a feature variable provides about the class.

This recipe demonstrates the modelling of a Classification Tree, we use a famous dataset by National institute of Diabetes and Digestive and Kidney Diseases.

STEP 1: Importing Necessary Libraries

# For data manipulation library(tidyverse) # For Decision Tree algorithm library(rpart) # for plotting the decision Tree install.packages("rpart.plot") library(rpart.plot) # Install readxl R package for reading excel sheets install.packages("readxl") library("readxl")

STEP 2: Loading the Train and Test Dataset

Loading the test and train dataset sepearately. Here Train and test are split in 80/20 proportion respectively.

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome).

Independent Variables: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age

Dependent Variables: Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes')

# calling the function read_excel from the readxl library train = read_excel('R_256_df_train_regression.xlsx') test = read_excel('R_256_df_test_regression.xlsx') # gives the number of observations and variables involved with its brief description glimpse(train)

Observations: 613
Variables: 9
$ Pregnancies               6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, ...
$ Glucose                   148, 85, 183, 89, 137, 116, 78, 115, 197, ...
$ BloodPressure             72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92,...
$ SkinThickness             35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, ...
$ Insulin                   0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, ...
$ BMI                       33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, ...
$ DiabetesPedigreeFunction  0.627, 0.351, 0.672, 0.167, 2.288, 0.201, ...
$ Age                       50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30...
$ Outcome                   1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...

# gives the number of observations and variables involved with its brief description glimpse(test)

Observations: 155
Variables: 9
$ Pregnancies               6, 11, 3, 6, 2, 9, 0, 2, 2, 6, 0, 2, 4, 0,...
$ Glucose                   105, 138, 106, 117, 68, 112, 119, 112, 92,...
$ BloodPressure             80, 74, 72, 96, 62, 82, 0, 86, 76, 94, 70,...
$ SkinThickness             28, 26, 0, 0, 13, 24, 0, 42, 20, 0, 27, 0,...
$ Insulin                   0, 144, 0, 0, 15, 0, 0, 160, 0, 0, 115, 0,...
$ BMI                       32.5, 36.1, 25.8, 28.7, 20.1, 28.2, 32.4, ...
$ DiabetesPedigreeFunction  0.878, 0.557, 0.207, 0.157, 0.257, 1.282, ...
$ Age                       26, 50, 27, 30, 23, 50, 24, 28, 28, 45, 21...
$ Outcome                   0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function.

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the independent variables in train dataset train_scaled = scale(train[2:6]) # using cbind() function to add a new column Outcome to the scaled independent values train_scaled = data.frame(cbind(train_scaled, Outcome = train$Outcome)) train_scaled %>% head()

Weight		Weight1		Length		Height		Width		Outcome
-0.33379271	-0.3132781	-0.08858827	0.4095324	-0.42466337	242
-0.22300101	-0.1970948	0.04945726	0.6459374	-0.22972408	290
-0.23684997	-0.1712763	0.03795346	0.6207701	0.03681581	340
0.09552513	0.1514550	0.31404453	0.7075012	-0.12740825	363
0.12322305	0.1514550	0.37156350	0.6370722	0.33570907	430
0.16476994	0.2418198	0.45209006	0.9223343	0.19469206	450

# scaling the independent variables in train dataset test_scaled = scale(test[2:6]) # using cbind() function to add a new column Outcome to the scaled independent values test_scaled = data.frame(cbind(test_scaled, Outcome = test$Outcome)) test_scaled %>% head()

Weight		Weight1		Length		Height		Width		Outcome
0.72483012	0.72445274	0.69959684	2.15715925	1.87080937	1000
0.07204194	0.08459639	0.09077507	0.03471101	-0.06904068	200
0.17201851	0.17756697	0.24556027	0.07758442	0.29059599	300
0.23082825	0.23225555	0.29715533	0.14769072	0.39466263	300
0.35432872	0.35803927	0.34875040	0.25564092	0.22707121	300
0.39549554	0.39632128	0.38486694	0.56280832	0.48296300	430

STEP 4: Creation of Decision Tree Classifier model using training set

We use rpart() function to fit the model.

Syntax: rpart(formula, data = , method = '')

Where:

Formula of the Decision Trees: Outcome ~. where Outcome is dependent variable and . represents all other independent variables
data = train_scaled
method = 'class' (to Fit a binary classification model)

# creation of an object 'model' using rpart function model = rpart(Outcome~., data = train_scaled, method = 'class')

Using rpart.plot() function to plot the decision tree model

rpart.plot(model)

STEP 5: Predict using Test Dataset

We use Predict() function to do the same.

Syntax: predict(fitted_model, df, type = '')

where:

fitted_model = model fitted by train dataset
df = test dataset
type = 'class' fpr classification

predict_test = predict(model, test_scaled, type = "class") predict_test %>% head()

STEP 6: Creation of confusion matrix

we use table() function to create the confusion matrix between actuals and predicted of Outcome Column

confusion_matrix = table(test_scaled$Outcome, predict_test) confusion_matrix

predict_test
     0  1
  0 84 16
  1 15 40

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Learn to Build Generative Models Using PyTorch Autoencoders

In this deep learning project, you will learn how to build a Generative Model using Autoencoders in PyTorch

View Project Details

Time Series Python Project using Greykite and Neural Prophet

In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

View Project Details

Build a Graph Based Recommendation System in Python-Part 2

In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.

View Project Details

Detectron2 Object Detection and Segmentation Example Python

Object Detection using Detectron2 - Build a Dectectron2 model to detect the zones and inhibitions in antibiogram images.

View Project Details

Learn Object Tracking (SOT, MOT) using OpenCV and Python

Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

View Project Details

How to build classification trees in R?

Recipe Objective

STEP 1: Importing Necessary Libraries

STEP 2: Loading the Train and Test Dataset

STEP 3: Data Preprocessing (Scaling)

STEP 4: Creation of Decision Tree Classifier model using training set

STEP 5: Predict using Test Dataset

STEP 6: Creation of confusion matrix

Ed Godalle

Relevant Projects

You might also like

Relevant Projects