How to perform Random Forest in R

This recipe helps you perform Random Forest in R
Last Updated: 26 Dec 2022

Get access to Data Science projects View all Data Science projects

DATA SCIENCE PROJECTS IN R DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

How to perform Random Forest in R.

Supervised learning is a type of machine learning ,were the user is given a data set and he already knows what the correct output should look like, having the idea that there is a relationship between the input and the output. There are two types of supervised learning : Regression : Linear Regression is a supervised learning algorithm used for continuous variables. Simple Linear Regression describes the relation between 2 variables, an independent variable (x) and a dependent variable (y). Classification : Logistic Regression is a classification type supervised learning model. Logistic Regression is used when the independent variable x, can be continuous or categorical variable , but the dependent variable (y) is a categorical variable. Decision tress : The random forest algorithm are built using the decision tress which can be regression or classification in nature. The decision tress builds model in the form of a tress structure. It splits you entire dataset into a structure of tress and makes decision on every node. Now, what is a random forest and why do we need it ? Random forest is a supervised learning algorithm that grows multiple decision tress and complies their results them into one. It is an ensemble technique made using multiple decision models. The ensemble technique uses multiple machine learning algorithms to obtain better predictive performance. Random forest selects random parameters for the decision making i.e its adds additional randomness to the model, while growing the trees. This leads to searching for the best feature among a random subset of feature which results in a wide diversity that generally results in a better model. This recipe demonstrates an example on performing Random Forest in R.

Recipe Objective

Step 1 - Install required packages

install.packages("dplyr") # Install dplyr for data manipulation library("dplyr") # Load dplyr # Installing the package install.packages("caTools") # For Logistic regression library(caTools) install.packages('randomForest') # For generating random forest model library(randomForest) install.packages('caret') # classification and regression training : The library caret has a function to make prediction. library(caret) install.packages('e1071', dependencies=TRUE)

Step 2 - Read the dataset

A dataset on heart disease is taken (classification problem), were predictions are to be made whether a patient has heart disease or not. The target variable is y : 'target'. class 0 : patient does not have heart disease class 1 : patient does not have heart disease

Dataset Description

age: age in years
sex: sex (1 = male; 0 = female)
cp: chest pain type

Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic

trestbps: resting blood pressure (in mm Hg on admission to the hospital)
chol: serum cholestoral in mg/dl
fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
restecg: resting electrocardiographic results

Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

thalach: maximum heart rate achieved
exang: exercise induced angina (1 = yes; 0 = no)
oldpeak : ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment
ca: number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
target: diagnosis of heart disease (angiographic disease status)

Value 0: < 50% diameter narrowing
Value 1: > 50% diameter narrowing

data = read.csv("http://storage.googleapis.com/dimensionless/ML_with_Python/Chapter%205/heart.csv") print(head(data)) dim(data) # returns the number of rows and columns in the dataset summary(data) # summary() function generates the statistical summary of the data

Step 3 - Split the data into train and test data sets

The training data is used for building a model, while the testing data is used for making predictions. This means after fitting a model on the training data set, finding of the errors and minimizing those error, the model is used for making predictions on the unseen data which is the test data.

split <- sample.split(data, SplitRatio = 0.8) split

The split method splits the data into train and test datasets with a ratio of 0.8 This means 80% of our dataset is passed in the training dataset and 20% in the testing dataset.

data_train <- subset(data, split == "TRUE") data_test <- subset(data, split == "FALSE")

The train dataset gets all the data points after split which are 'TRUE' and similarly the test dataset gets all the data points which are 'FALSE'.

dim(data_train) # dimension/shape of train dataset head(data_train) dim(data_test) # dimension/shape of test dataset head(data_test)

Step 4 - Convert target variable to a factor form

Since are target variable is a yes/no type variable and the rest are numeric type variables, we convert target variable to a factor form in order to maintain the consistency

data$target <- as.factor(data$target) data_train$target <- as.factor(data_train$target)

Step 5 - Finding optimized value of 'm'(random variables)

tune RF returns the best optimized value of random varaible is 3 corresponding to a OOB of 0% (OOB - prediction error)

bestmtry <- tuneRF(data_train,data_train$target,stepFactor = 1.2, improve = 0.01, trace=T, plot= T)

Step 6 - Create a Random forest model

model <- randomForest(target~.,data= data_train) model

The model summary suggests that, the type of random forest is classification , and 500 random forest trees were created and at every node, the node splits into 3 child nodes . The confusion matrix suggests that , TP - 115 patients were correctly identified for having a heart disease TN - 81 patients were correctly identifies for not having a heart disease FP - 27 patients were falsely identifies for having a heart disease when infact they did not have a heart disease FN - 14 patients were falsely identifies for not having a heart disease when infact they did have a heart disease

importance(model) # returns the importance of the variables : most siginificant - cp followed by thalach and so on...... varImpPlot(model) # visualizing the importance of variables of the model.

Step 7 - Make predictions on test data

After the model is created and fitted, this model is used for making predictions on the unseen data values i.e the test dataset.

pred_test <- predict(model, newdata = data_test, type= "class") pred_test confusionMatrix(table(pred_test,data_test$target)) # The prediction to compute the confusion matrix and see the accuracy score

The confusion matrix shows the clear picture of the test data The accuracy is 83.33% which is pretty good.

{"mode":"full","

isActive

":false}

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More