MACHINE LEARNING RECIPES
DATA CLEANING PYTHON
DATA MUNGING
PANDAS CHEATSHEET
ALL TAGS
# How to build classification trees in R?

# How to build classification trees in R?

This recipe helps you build classification trees in R

Decision Tree is a supervised machine learning algorithm which can be used to perform both classification and regression on complex datasets. They are also known as Classification and Regression Trees (CART). Hence, it works for both continuous and categorical variables.

Important basic tree Terminology is as follows:

- Root node: represents an entire popuplation or dataset which gets divided into two or more pure sets (also known as homogeneuos steps). It always contains a single input variable (x).
- Leaf or terminal node: These nodes do not split further and contains the output variable

In this recipe, we will only focus on Classification Trees where the target variable is categorical in nature. The splits in these trees are based on the homogeneity of the groups formed. The homogeinity or impurity in the data is quantified by computing metrics like Entropy, Information Gain and Gini Index.

Most commonly used Metric is Information gain. It is the measure to quantify how much information a feature variable provides about the class.

This recipe demonstrates the modelling of a Classification Tree, we use a famous dataset by National institute of Diabetes and Digestive and Kidney Diseases.

```
# For data manipulation
library(tidyverse)
# For Decision Tree algorithm
library(rpart)
# for plotting the decision Tree
install.packages("rpart.plot")
library(rpart.plot)
# Install readxl R package for reading excel sheets
install.packages("readxl")
library("readxl")
```

Loading the test and train dataset sepearately. Here Train and test are split in 80/20 proportion respectively.

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome).

Independent Variables: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age

Dependent Variables: Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes')

```
# calling the function read_excel from the readxl library
train = read_excel('R_256_df_train_regression.xlsx')
test = read_excel('R_256_df_test_regression.xlsx')
# gives the number of observations and variables involved with its brief description
glimpse(train)
```

Observations: 613 Variables: 9 $ Pregnancies6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, ... $ Glucose 148, 85, 183, 89, 137, 116, 78, 115, 197, ... $ BloodPressure 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92,... $ SkinThickness 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, ... $ Insulin 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, ... $ BMI 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, ... $ DiabetesPedigreeFunction 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, ... $ Age 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30... $ Outcome 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...

```
# gives the number of observations and variables involved with its brief description
glimpse(test)
```

Observations: 155 Variables: 9 $ Pregnancies6, 11, 3, 6, 2, 9, 0, 2, 2, 6, 0, 2, 4, 0,... $ Glucose 105, 138, 106, 117, 68, 112, 119, 112, 92,... $ BloodPressure 80, 74, 72, 96, 62, 82, 0, 86, 76, 94, 70,... $ SkinThickness 28, 26, 0, 0, 13, 24, 0, 42, 20, 0, 27, 0,... $ Insulin 0, 144, 0, 0, 15, 0, 0, 160, 0, 0, 115, 0,... $ BMI 32.5, 36.1, 25.8, 28.7, 20.1, 28.2, 32.4, ... $ DiabetesPedigreeFunction 0.878, 0.557, 0.207, 0.157, 0.257, 1.282, ... $ Age 26, 50, 27, 30, 23, 50, 24, 28, 28, 45, 21... $ Outcome 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function.

Note: Scaling is an important pre-modelling step which has to be mandatory

```
# scaling the independent variables in train dataset
train_scaled = scale(train[2:6])
# using cbind() function to add a new column Outcome to the scaled independent values
train_scaled = data.frame(cbind(train_scaled, Outcome = train$Outcome))
train_scaled %>% head()
```

Weight Weight1 Length Height Width Outcome -0.33379271 -0.3132781 -0.08858827 0.4095324 -0.42466337 242 -0.22300101 -0.1970948 0.04945726 0.6459374 -0.22972408 290 -0.23684997 -0.1712763 0.03795346 0.6207701 0.03681581 340 0.09552513 0.1514550 0.31404453 0.7075012 -0.12740825 363 0.12322305 0.1514550 0.37156350 0.6370722 0.33570907 430 0.16476994 0.2418198 0.45209006 0.9223343 0.19469206 450

```
# scaling the independent variables in train dataset
test_scaled = scale(test[2:6])
# using cbind() function to add a new column Outcome to the scaled independent values
test_scaled = data.frame(cbind(test_scaled, Outcome = test$Outcome))
test_scaled %>% head()
```

Weight Weight1 Length Height Width Outcome 0.72483012 0.72445274 0.69959684 2.15715925 1.87080937 1000 0.07204194 0.08459639 0.09077507 0.03471101 -0.06904068 200 0.17201851 0.17756697 0.24556027 0.07758442 0.29059599 300 0.23082825 0.23225555 0.29715533 0.14769072 0.39466263 300 0.35432872 0.35803927 0.34875040 0.25564092 0.22707121 300 0.39549554 0.39632128 0.38486694 0.56280832 0.48296300 430

We use rpart() function to fit the model.

Syntax: rpart(formula, data = , method = '')

Where:

- Formula of the Decision Trees: Outcome ~. where Outcome is dependent variable and . represents all other independent variables
- data = train_scaled
- method = 'class' (to Fit a binary classification model)

```
# creation of an object 'model' using rpart function
model = rpart(Outcome~., data = train_scaled, method = 'class')
```

Using rpart.plot() function to plot the decision tree model

```
rpart.plot(model)
```

We use Predict() function to do the same.

Syntax: predict(fitted_model, df, type = '')

where:

- fitted_model = model fitted by train dataset
- df = test dataset
- type = 'class' fpr classification

```
predict_test = predict(model, test_scaled, type = "class")
predict_test %>% head()
```

1 0 2 1 3 0 4 0 5 0 6 1

we use table() function to create the confusion matrix between actuals and predicted of Outcome Column

```
confusion_matrix = table(test_scaled$Outcome, predict_test)
confusion_matrix
```

predict_test 0 1 0 84 16 1 15 40

In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.

In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

In this human activity recognition project, we use multiclass classification machine learning techniques to analyse fitness dataset from a smartphone tracker.

In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.