How to build classification trees in R?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

How to build classification trees in R?

How to build classification trees in R?

This recipe helps you build classification trees in R

0

Recipe Objective

Decision Tree is a supervised machine learning algorithm which can be used to perform both classification and regression on complex datasets. They are also known as Classification and Regression Trees (CART). Hence, it works for both continuous and categorical variables.

Important basic tree Terminology is as follows: ​

  1. Root node: represents an entire popuplation or dataset which gets divided into two or more pure sets (also known as homogeneuos steps). It always contains a single input variable (x).
  2. Leaf or terminal node: These nodes do not split further and contains the output variable

In this recipe, we will only focus on Classification Trees where the target variable is categorical in nature. The splits in these trees are based on the homogeneity of the groups formed. The homogeinity or impurity in the data is quantified by computing metrics like Entropy, Information Gain and Gini Index. ​

Most commonly used Metric is Information gain. It is the measure to quantify how much information a feature variable provides about the class. ​

This recipe demonstrates the modelling of a Classification Tree, we use a famous dataset by National institute of Diabetes and Digestive and Kidney Diseases. ​

STEP 1: Importing Necessary Libraries

# For data manipulation library(tidyverse) # For Decision Tree algorithm library(rpart) # for plotting the decision Tree install.packages("rpart.plot") library(rpart.plot) # Install readxl R package for reading excel sheets install.packages("readxl") library("readxl")

STEP 2: Loading the Train and Test Dataset

Loading the test and train dataset sepearately. Here Train and test are split in 80/20 proportion respectively.

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome). ​

Independent Variables: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age

Dependent Variables: Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes')

# calling the function read_excel from the readxl library train = read_excel('R_256_df_train_regression.xlsx') test = read_excel('R_256_df_test_regression.xlsx') # gives the number of observations and variables involved with its brief description glimpse(train)
Observations: 613
Variables: 9
$ Pregnancies               6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, ...
$ Glucose                   148, 85, 183, 89, 137, 116, 78, 115, 197, ...
$ BloodPressure             72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92,...
$ SkinThickness             35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, ...
$ Insulin                   0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, ...
$ BMI                       33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, ...
$ DiabetesPedigreeFunction  0.627, 0.351, 0.672, 0.167, 2.288, 0.201, ...
$ Age                       50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30...
$ Outcome                   1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...
# gives the number of observations and variables involved with its brief description glimpse(test)
Observations: 155
Variables: 9
$ Pregnancies               6, 11, 3, 6, 2, 9, 0, 2, 2, 6, 0, 2, 4, 0,...
$ Glucose                   105, 138, 106, 117, 68, 112, 119, 112, 92,...
$ BloodPressure             80, 74, 72, 96, 62, 82, 0, 86, 76, 94, 70,...
$ SkinThickness             28, 26, 0, 0, 13, 24, 0, 42, 20, 0, 27, 0,...
$ Insulin                   0, 144, 0, 0, 15, 0, 0, 160, 0, 0, 115, 0,...
$ BMI                       32.5, 36.1, 25.8, 28.7, 20.1, 28.2, 32.4, ...
$ DiabetesPedigreeFunction  0.878, 0.557, 0.207, 0.157, 0.257, 1.282, ...
$ Age                       26, 50, 27, 30, 23, 50, 24, 28, 28, 45, 21...
$ Outcome                   0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function.

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the independent variables in train dataset train_scaled = scale(train[2:6]) # using cbind() function to add a new column Outcome to the scaled independent values train_scaled = data.frame(cbind(train_scaled, Outcome = train$Outcome)) train_scaled %>% head()
Weight		Weight1		Length		Height		Width		Outcome
-0.33379271	-0.3132781	-0.08858827	0.4095324	-0.42466337	242
-0.22300101	-0.1970948	0.04945726	0.6459374	-0.22972408	290
-0.23684997	-0.1712763	0.03795346	0.6207701	0.03681581	340
0.09552513	0.1514550	0.31404453	0.7075012	-0.12740825	363
0.12322305	0.1514550	0.37156350	0.6370722	0.33570907	430
0.16476994	0.2418198	0.45209006	0.9223343	0.19469206	450
# scaling the independent variables in train dataset test_scaled = scale(test[2:6]) # using cbind() function to add a new column Outcome to the scaled independent values test_scaled = data.frame(cbind(test_scaled, Outcome = test$Outcome)) test_scaled %>% head()
Weight		Weight1		Length		Height		Width		Outcome
0.72483012	0.72445274	0.69959684	2.15715925	1.87080937	1000
0.07204194	0.08459639	0.09077507	0.03471101	-0.06904068	200
0.17201851	0.17756697	0.24556027	0.07758442	0.29059599	300
0.23082825	0.23225555	0.29715533	0.14769072	0.39466263	300
0.35432872	0.35803927	0.34875040	0.25564092	0.22707121	300
0.39549554	0.39632128	0.38486694	0.56280832	0.48296300	430

STEP 4: Creation of Decision Tree Classifier model using training set

We use rpart() function to fit the model.

Syntax: rpart(formula, data = , method = '')

Where:

  1. Formula of the Decision Trees: Outcome ~. where Outcome is dependent variable and . represents all other independent variables
  2. data = train_scaled
  3. method = 'class' (to Fit a binary classification model)
# creation of an object 'model' using rpart function model = rpart(Outcome~., data = train_scaled, method = 'class')

Using rpart.plot() function to plot the decision tree model

rpart.plot(model)

STEP 5: Predict using Test Dataset

We use Predict() function to do the same.

Syntax: predict(fitted_model, df, type = '')

where:

  1. fitted_model = model fitted by train dataset
  2. df = test dataset
  3. type = 'class' fpr classification
predict_test = predict(model, test_scaled, type = "class") predict_test %>% head()
1 0
2 1
3 0
4 0
5 0
6 1  
​

STEP 6: Creation of confusion matrix

we use table() function to create the confusion matrix between actuals and predicted of Outcome Column

confusion_matrix = table(test_scaled$Outcome, predict_test) confusion_matrix
predict_test
     0  1
  0 84 16
  1 15 40
​

Relevant Projects

Perform Time series modelling using Facebook Prophet
In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

Build a Similar Images Finder with Python, Keras, and Tensorflow
Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Machine Learning or Predictive Models in IoT - Energy Prediction Use Case
In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

Human Activity Recognition Using Multiclass Classification in Python
In this human activity recognition project, we use multiclass classification machine learning techniques to analyse fitness dataset from a smartphone tracker.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.