How to do optimal parameters for Random Forest in R

This recipe helps you do optimal parameters for Random Forest in R

Recipe Objective

How to do optimal parameters for Random Forest in R?

Random forest is a supervised learning algorithm that grows multiple decision trees and complies their results into one. It is an ensemble technique made using multiple decision models. The ensemble technique uses multiple machine learning algorithms to obtain better predictive performance. Random forest selects random parameters for the decision making i.e its adds additional randomness to the model while growing the trees. This leads to searching for the best feature among a random subset of features, which then results in a better model. Hyperparameter tuning is a process for searching for the best parameters for creating an ideal model. Tuning the model hyperparameter is very important as it directly impacts the behavior of our training model which further has a significant impact on the testing dataset. There are many different hyperparameter tuning methods available such as manual search, grid search, random search, Bayesian optimization. We are going to use tuneRF () function in this example for finding the optimal parameter for our random forest. This recipe demonstrates an example of how to do optimal parameters for Random Forest in R.

Access Text Classification using Naive Bayes Python Code

Step 1 - Install required packages

install.packages("dplyr") # Install dplyr for data manipulation library("dplyr") # Load dplyr install.packages('caret') # classification and regression training : The library caret has a function to make prediction. library(caret) install.packages('e1071', dependencies=TRUE)

Step 2 - Read the dataset

A dataset on heart disease is taken (classification problem), were predictions are to be made whether a patient has heart disease or not. The target variable is y : 'target'. class 0 : patient does not have heart disease class 1 : patient does not have heart disease

Dataset Description

    • age: age in years
    • sex: sex (1 = male; 0 = female)
    • cp: chest pain type
      • Value 1: typical angina
      • Value 2: atypical angina
      • Value 3: non-anginal pain
      • Value 4: asymptomatic
    • trestbps: resting blood pressure (in mm Hg on admission to the hospital)
    • chol: serum cholestoral in mg/dl
    • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
    • restecg: resting electrocardiographic results
      • Value 0: normal
      • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
      • Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    • thalach: maximum heart rate achieved
    • exang: exercise induced angina (1 = yes; 0 = no)
    • oldpeak : ST depression induced by exercise relative to rest
    • slope: the slope of the peak exercise ST segment
    • ca: number of major vessels (0-3) colored by flourosopy
    • thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
    • target: diagnosis of heart disease (angiographic disease status)
      • Value 0: < 50% diameter narrowing
      • Value 1: > 50% diameter narrowing

data = read.csv("http://storage.googleapis.com/dimensionless/ML_with_Python/Chapter%205/heart.csv") print(head(data)) dim(data) # returns the number of rows and columns in the dataset summary(data) # summary() function generates the statistical summary of the data

Step 3 - Split the data into train and test data sets

The training data is used for building a model, while the testing data is used for making predictions. This means after fitting a model on the training data set, finding of the errors and minimizing those error, the model is used for making predictions on the unseen data which is the test data.

split <- sample.split(data, SplitRatio = 0.8) split data_train <- subset(data, split == "TRUE") data_test <- subset(data, split == "FALSE")

Step 4 - Convert target variable to a factor form

Since are target variable is a yes/no type variable and the rest are numeric type variables, we convert target variable to a factor form in order to maintain the consistency

data$target <- as.factor(data$target) data_train$target <- as.factor(data_train$target)

Step 5 - Finding optimized parameters

We can use the tuneRF () function for finding the optimal parameter: By default, the random Forest () function uses 500 trees and randomly selected predictors as potential candidates at each split. These parameters can be adjusted by using the tuneRF () function. Syntax: tuneRF (data, target variable, stepFactor, improve, trace, plot) where, Data: the training data for building the model Target variable: the dependent variables stepFactor: It is a factor used to increase, by until the out-of-bag (OOB) estimated error stops improved by a certain amount. Improve: It is the amount that the out-of-bag (OOB) error needs to improve by keeping increasing the step factor.

bestmtry <- tuneRF(data_train,data_train$target,stepFactor = 1.2, improve = 0.01, trace=T, plot= T)

tune RF returns the best optimized value of random varaible is 3 corresponding to a OOB of 0% (OOB - prediction error)

          {"mode":"full","

isActive

        ":false}

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Medical Image Segmentation Deep Learning Project
In this deep learning project, you will learn to implement Unet++ models for medical image segmentation to detect and classify colorectal polyps.

Build Classification Algorithms for Digital Transformation[Banking]
Implement a machine learning approach using various classification techniques in Python to examine the digitalisation process of bank customers.

Learn Object Tracking (SOT, MOT) using OpenCV and Python
Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Build a Similar Images Finder with Python, Keras, and Tensorflow
Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

Learn How to Build PyTorch Neural Networks from Scratch
In this deep learning project, you will learn how to build PyTorch neural networks from scratch.

MLOps Project to Deploy Resume Parser Model on Paperspace
In this MLOps project, you will learn how to deploy a Resume Parser Streamlit Application on Paperspace Private Cloud.

Isolation Forest Model and LOF for Anomaly Detection in Python
Credit Card Fraud Detection Project - Build an Isolation Forest Model and Local Outlier Factor (LOF) in Python to identify fraudulent credit card transactions.

Learn Hyperparameter Tuning for Neural Networks with PyTorch
In this Deep Learning Project, you will learn how to optimally tune the hyperparameters (learning rate, epochs, dropout, early stopping) of a neural network model in PyTorch to improve model performance.

Linear Regression Model Project in Python for Beginners Part 1
Machine Learning Linear Regression Project in Python to build a simple linear regression model and master the fundamentals of regression for beginners.