How to perform Random Forest in R

This recipe helps you perform Random Forest in R

Recipe Objective

How to perform Random Forest in R.

Supervised learning is a type of machine learning ,were the user is given a data set and he already knows what the correct output should look like, having the idea that there is a relationship between the input and the output. There are two types of supervised learning : Regression : Linear Regression is a supervised learning algorithm used for continuous variables. Simple Linear Regression describes the relation between 2 variables, an independent variable (x) and a dependent variable (y). Classification : Logistic Regression is a classification type supervised learning model. Logistic Regression is used when the independent variable x, can be continuous or categorical variable , but the dependent variable (y) is a categorical variable. Decision tress : The random forest algorithm are built using the decision tress which can be regression or classification in nature. The decision tress builds model in the form of a tress structure. It splits you entire dataset into a structure of tress and makes decision on every node. Now, what is a random forest and why do we need it ? Random forest is a supervised learning algorithm that grows multiple decision tress and complies their results them into one. It is an ensemble technique made using multiple decision models. The ensemble technique uses multiple machine learning algorithms to obtain better predictive performance. Random forest selects random parameters for the decision making i.e its adds additional randomness to the model, while growing the trees. This leads to searching for the best feature among a random subset of feature which results in a wide diversity that generally results in a better model. This recipe demonstrates an example on performing Random Forest in R.

Step 1 - Install required packages

install.packages("dplyr") # Install dplyr for data manipulation library("dplyr") # Load dplyr # Installing the package install.packages("caTools") # For Logistic regression library(caTools) install.packages('randomForest') # For generating random forest model library(randomForest) install.packages('caret') # classification and regression training : The library caret has a function to make prediction. library(caret) install.packages('e1071', dependencies=TRUE)

Step 2 - Read the dataset

A dataset on heart disease is taken (classification problem), were predictions are to be made whether a patient has heart disease or not. The target variable is y : 'target'. class 0 : patient does not have heart disease class 1 : patient does not have heart disease

Dataset Description

    • age: age in years
    • sex: sex (1 = male; 0 = female)
    • cp: chest pain type
      • Value 1: typical angina
      • Value 2: atypical angina
      • Value 3: non-anginal pain
      • Value 4: asymptomatic
    • trestbps: resting blood pressure (in mm Hg on admission to the hospital)
    • chol: serum cholestoral in mg/dl
    • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
    • restecg: resting electrocardiographic results
      • Value 0: normal
      • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
      • Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    • thalach: maximum heart rate achieved
    • exang: exercise induced angina (1 = yes; 0 = no)
    • oldpeak : ST depression induced by exercise relative to rest
    • slope: the slope of the peak exercise ST segment
    • ca: number of major vessels (0-3) colored by flourosopy
    • thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
    • target: diagnosis of heart disease (angiographic disease status)
      • Value 0: < 50% diameter narrowing
      • Value 1: > 50% diameter narrowing

data = read.csv("http://storage.googleapis.com/dimensionless/ML_with_Python/Chapter%205/heart.csv") print(head(data)) dim(data) # returns the number of rows and columns in the dataset summary(data) # summary() function generates the statistical summary of the data

Step 3 - Split the data into train and test data sets

The training data is used for building a model, while the testing data is used for making predictions. This means after fitting a model on the training data set, finding of the errors and minimizing those error, the model is used for making predictions on the unseen data which is the test data.

split <- sample.split(data, SplitRatio = 0.8) split

The split method splits the data into train and test datasets with a ratio of 0.8 This means 80% of our dataset is passed in the training dataset and 20% in the testing dataset.

data_train <- subset(data, split == "TRUE") data_test <- subset(data, split == "FALSE")

The train dataset gets all the data points after split which are 'TRUE' and similarly the test dataset gets all the data points which are 'FALSE'.

dim(data_train) # dimension/shape of train dataset head(data_train) dim(data_test) # dimension/shape of test dataset head(data_test)

Step 4 - Convert target variable to a factor form

Since are target variable is a yes/no type variable and the rest are numeric type variables, we convert target variable to a factor form in order to maintain the consistency

data$target <- as.factor(data$target) data_train$target <- as.factor(data_train$target)

Step 5 - Finding optimized value of 'm'(random variables)

tune RF returns the best optimized value of random varaible is 3 corresponding to a OOB of 0% (OOB - prediction error)

bestmtry <- tuneRF(data_train,data_train$target,stepFactor = 1.2, improve = 0.01, trace=T, plot= T)

Step 6 - Create a Random forest model

model <- randomForest(target~.,data= data_train) model

The model summary suggests that, the type of random forest is classification , and 500 random forest trees were created and at every node, the node splits into 3 child nodes . The confusion matrix suggests that , TP - 115 patients were correctly identified for having a heart disease TN - 81 patients were correctly identifies for not having a heart disease FP - 27 patients were falsely identifies for having a heart disease when infact they did not have a heart disease FN - 14 patients were falsely identifies for not having a heart disease when infact they did have a heart disease

importance(model) # returns the importance of the variables : most siginificant - cp followed by thalach and so on...... varImpPlot(model) # visualizing the importance of variables of the model.

Step 7 - Make predictions on test data

After the model is created and fitted, this model is used for making predictions on the unseen data values i.e the test dataset.

pred_test <- predict(model, newdata = data_test, type= "class") pred_test confusionMatrix(table(pred_test,data_test$target)) # The prediction to compute the confusion matrix and see the accuracy score

The confusion matrix shows the clear picture of the test data The accuracy is 83.33% which is pretty good.

          {"mode":"full","

isActive

        ":false}

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Build a Face Recognition System in Python using FaceNet
In this deep learning project, you will build your own face recognition system in Python using OpenCV and FaceNet by extracting features from an image of a person's face.

Hands-On Approach to Master PyTorch Tensors with Examples
In this deep learning project, you will learn how to perform various operations on the building block of PyTorch : Tensors.

Build a Customer Churn Prediction Model using Decision Trees
Develop a customer churn prediction model using decision tree machine learning algorithms and data science on streaming service data.

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

NLP Project for Multi Class Text Classification using BERT Model
In this NLP Project, you will learn how to build a multi-class text classification model using using the pre-trained BERT model.

Build a Graph Based Recommendation System in Python -Part 1
Python Recommender Systems Project - Learn to build a graph based recommendation system in eCommerce to recommend products.

Build Regression Models in Python for House Price Prediction
In this Machine Learning Regression project, you will build and evaluate various regression models in Python for house price prediction.

Build a Hybrid Recommender System in Python using LightFM
In this Recommender System project, you will build a hybrid recommender system in Python using LightFM .

Build a Speech-Text Transcriptor with Nvidia Quartznet Model
In this Deep Learning Project, you will leverage transfer learning from Nvidia QuartzNet pre-trained models to develop a speech-to-text transcriptor.

Census Income Data Set Project-Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.