How to use RFE in R?

This recipe helps you use RFE in R
Last Updated: 26 Dec 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

One of the major challenges in building an model is to keep the model simple for better interpretabililty. The complexity of the model is measured by the number of features (i.e. predictors). Large number of predictors increases the complexity of the model leading to multicollinearity and overfitting. Thus, it is important to find an optimum number of features.

Recursive feature elimination is one of the wrapper methods of feature selection which uses a machine learning algorithm to find the best features. Recursive feature elimination performs a greedy search to find the best performing feature subset. It iteratively creates models and determines the best or the worst performing feature at each iteration. It constructs the subsequent models with the left features until all the features are explored. It then ranks the features based on the order of their elimination. In the worst case, if a dataset contains N number of features RFE will do a greedy search for 2N combinations of features.

In this recipe, we will only focus on performing recursive feature selection by performing random forest algorithm using rfe() function in Caret package. This is carried out on a famous dataset by National institute of Diabetes and Digestive and Kidney Diseases.

Should I Learn Python or R for Data Science? Unlock the Answer

Recipe Objective

STEP 1: Importing Necessary Libraries

install.packages('caret') # for general data preparation and model fitting library(caret) # for data manipulation library(tidyverse)

STEP 2: Read a csv file and explore the data

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome).

Independent Variables:

Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
DiabetesPedigreeFunction
Age

Dependent Variable:

Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes')

data <- read.csv("R_337_diabetes.csv") glimpse(data)

Rows: 768
Columns: 9
$ Pregnancies               6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, ...
$ Glucose                   148, 85, 183, 89, 137, 116, 78, 115, 197, ...
$ BloodPressure             72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92,...
$ SkinThickness             35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, ...
$ Insulin                   0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, ...
$ BMI                       33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, ...
$ DiabetesPedigreeFunction  0.627, 0.351, 0.672, 0.167, 2.288, 0.201, ...
$ Age                       50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30...
$ Outcome                   1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...

summary(data) # returns the statistical summary of the data columns

Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
    Outcome     
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000

dim(data)

768   9

# Converting the dependent variable into factor levels data$Outcome = as.factor(data$Outcome)

STEP 3: Train Test Split

# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%) parts = createDataPartition(data$Outcome, p = .8, list = F) train = data[parts, ] test = data[-parts, ] X_train = train[,-9] y_train = train[,9]

STEP 4: Performing recursive feature elimination

We will use rfe() function from CARET package to implement Recursive Feature elimination.

Syntax: ref(x, y, sizes = , rfecontrol =)

where:

x = dataframe or matrix of features
y = target variable
sizes = number of features that needs to be selected
KMeans Clustering: commonly used when we have large dataset
rfeControl : list of control options such as algorithm, cross validation etc.

# specifying the CV technique as well as the random forest algorithm which will be passed into the rfe() function in feature selection control_rfe = rfeControl(functions = rfFuncs, # random forest method = "repeatedcv", # repeated cv repeats = 5, # number of repeats number = 10) # number of folds set.seed(50) # Performing RFE result_rfe = rfe(x = X_train, y = y_train, sizes = c(1:8), rfeControl = control_rfe) # summarising the results result_rfe

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         1   0.6946 0.2827    0.05737 0.13473         
         2   0.7063 0.3356    0.05013 0.11179         
         3   0.7256 0.3811    0.04727 0.10913         
         4   0.7337 0.4012    0.05000 0.11852         
         5   0.7438 0.4255    0.04595 0.09999         
         6   0.7474 0.4261    0.04910 0.10992         
         7   0.7499 0.4349    0.04987 0.10736         
         8   0.7639 0.4606    0.04607 0.10471        *

The top 5 variables (out of 8):
   Glucose, BMI, Age, Pregnancies, Insulin

Note: Top 5 based on the accuracy is Glucose, BMI, Age, Pregnancies, Insulin out of the 8 features chosen.

# all the features selected by rfe predictors(result_rfe)

Glucose' 'BMI' 'Age' 'Pregnancies' 'Insulin' 'DiabetesPedigreeFunction' 'SkinThickness' 'BloodPressure'

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Linear Regression Model Project in Python for Beginners Part 1

Machine Learning Linear Regression Project in Python to build a simple linear regression model and master the fundamentals of regression for beginners.

View Project Details

Build an AI Chatbot from Scratch using Keras Sequential Model

In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

View Project Details

Loan Default Prediction Project using Explainable AI ML Models

Loan Default Prediction Project that employs sophisticated machine learning models, such as XGBoost and Random Forest and delves deep into the realm of Explainable AI, ensuring every prediction is transparent and understandable.

View Project Details

Learn Object Tracking (SOT, MOT) using OpenCV and Python

Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

View Project Details

How to use RFE in R?

Recipe Objective

Table of Contents

STEP 1: Importing Necessary Libraries

STEP 2: Read a csv file and explore the data

STEP 3: Train Test Split

STEP 4: Performing recursive feature elimination

Ray han

Relevant Projects

You might also like

Relevant Projects