How to use RFE in R?

This recipe helps you use RFE in R

Recipe Objective

One of the major challenges in building an model is to keep the model simple for better interpretabililty. The complexity of the model is measured by the number of features (i.e. predictors). Large number of predictors increases the complexity of the model leading to multicollinearity and overfitting. Thus, it is important to find an optimum number of features.

Recursive feature elimination is one of the wrapper methods of feature selection which uses a machine learning algorithm to find the best features. Recursive feature elimination performs a greedy search to find the best performing feature subset. It iteratively creates models and determines the best or the worst performing feature at each iteration. It constructs the subsequent models with the left features until all the features are explored. It then ranks the features based on the order of their elimination. In the worst case, if a dataset contains N number of features RFE will do a greedy search for 2N combinations of features. ​

In this recipe, we will only focus on performing recursive feature selection by performing random forest algorithm using rfe() function in Caret package. This is carried out on a famous dataset by National institute of Diabetes and Digestive and Kidney Diseases. ​

​Should I Learn Python or R for Data Science? Unlock the Answer

STEP 1: Importing Necessary Libraries

install.packages('caret') # for general data preparation and model fitting library(caret) # for data manipulation library(tidyverse)

STEP 2: Read a csv file and explore the data

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome).

Independent Variables: ​

  1. Pregnancies
  2. Glucose
  3. BloodPressure
  4. SkinThickness
  5. Insulin
  6. BMI
  7. DiabetesPedigreeFunction
  8. Age

Dependent Variable: ​

Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes') ​

data <- read.csv("R_337_diabetes.csv") glimpse(data)

Rows: 768
Columns: 9
$ Pregnancies               6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, ...
$ Glucose                   148, 85, 183, 89, 137, 116, 78, 115, 197, ...
$ BloodPressure             72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92,...
$ SkinThickness             35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, ...
$ Insulin                   0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, ...
$ BMI                       33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, ...
$ DiabetesPedigreeFunction  0.627, 0.351, 0.672, 0.167, 2.288, 0.201, ...
$ Age                       50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30...
$ Outcome                   1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...

summary(data) # returns the statistical summary of the data columns

Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
    Outcome     
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000  

dim(data)

768   9

# Converting the dependent variable into factor levels data$Outcome = as.factor(data$Outcome)

STEP 3: Train Test Split

# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%) parts = createDataPartition(data$Outcome, p = .8, list = F) train = data[parts, ] test = data[-parts, ] X_train = train[,-9] y_train = train[,9]

STEP 4: Performing recursive feature elimination

We will use rfe() function from CARET package to implement Recursive Feature elimination.

Syntax: ref(x, y, sizes = , rfecontrol =)

where:

  1. x = dataframe or matrix of features
  2. y = target variable
  3. sizes = number of features that needs to be selected
  4. KMeans Clustering: commonly used when we have large dataset
  5. rfeControl : list of control options such as algorithm, cross validation etc.

# specifying the CV technique as well as the random forest algorithm which will be passed into the rfe() function in feature selection control_rfe = rfeControl(functions = rfFuncs, # random forest method = "repeatedcv", # repeated cv repeats = 5, # number of repeats number = 10) # number of folds set.seed(50) # Performing RFE result_rfe = rfe(x = X_train, y = y_train, sizes = c(1:8), rfeControl = control_rfe) # summarising the results result_rfe

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         1   0.6946 0.2827    0.05737 0.13473         
         2   0.7063 0.3356    0.05013 0.11179         
         3   0.7256 0.3811    0.04727 0.10913         
         4   0.7337 0.4012    0.05000 0.11852         
         5   0.7438 0.4255    0.04595 0.09999         
         6   0.7474 0.4261    0.04910 0.10992         
         7   0.7499 0.4349    0.04987 0.10736         
         8   0.7639 0.4606    0.04607 0.10471        *

The top 5 variables (out of 8):
   Glucose, BMI, Age, Pregnancies, Insulin

Note: Top 5 based on the accuracy is Glucose, BMI, Age, Pregnancies, Insulin out of the 8 features chosen.

# all the features selected by rfe predictors(result_rfe)

Glucose' 'BMI' 'Age' 'Pregnancies' 'Insulin' 'DiabetesPedigreeFunction' 'SkinThickness' 'BloodPressure'

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Linear Regression Model Project in Python for Beginners Part 1
Machine Learning Linear Regression Project in Python to build a simple linear regression model and master the fundamentals of regression for beginners.

Build an AI Chatbot from Scratch using Keras Sequential Model
In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

Loan Default Prediction Project using Explainable AI ML Models
Loan Default Prediction Project that employs sophisticated machine learning models, such as XGBoost and Random Forest and delves deep into the realm of Explainable AI, ensuring every prediction is transparent and understandable.

Learn Object Tracking (SOT, MOT) using OpenCV and Python
Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

AWS MLOps Project to Deploy Multiple Linear Regression Model
Build and Deploy a Multiple Linear Regression Model in Python on AWS

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

MLOps using Azure Devops to Deploy a Classification Model
In this MLOps Azure project, you will learn how to deploy a classification machine learning model to predict the customer's license status on Azure through scalable CI/CD ML pipelines.

NLP Project for Beginners on Text Processing and Classification
This Project Explains the Basic Text Preprocessing and How to Build a Classification Model in Python

Learn to Build a Neural network from Scratch using NumPy
In this deep learning project, you will learn to build a neural network from scratch using NumPy

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit