How to use RFE in R?

This recipe helps you use RFE in R

Recipe Objective

One of the major challenges in building an model is to keep the model simple for better interpretabililty. The complexity of the model is measured by the number of features (i.e. predictors). Large number of predictors increases the complexity of the model leading to multicollinearity and overfitting. Thus, it is important to find an optimum number of features.

Recursive feature elimination is one of the wrapper methods of feature selection which uses a machine learning algorithm to find the best features. Recursive feature elimination performs a greedy search to find the best performing feature subset. It iteratively creates models and determines the best or the worst performing feature at each iteration. It constructs the subsequent models with the left features until all the features are explored. It then ranks the features based on the order of their elimination. In the worst case, if a dataset contains N number of features RFE will do a greedy search for 2N combinations of features. ​

In this recipe, we will only focus on performing recursive feature selection by performing random forest algorithm using rfe() function in Caret package. This is carried out on a famous dataset by National institute of Diabetes and Digestive and Kidney Diseases. ​

​Should I Learn Python or R for Data Science? Unlock the Answer

STEP 1: Importing Necessary Libraries

install.packages('caret') # for general data preparation and model fitting library(caret) # for data manipulation library(tidyverse)

STEP 2: Read a csv file and explore the data

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome).

Independent Variables: ​

  1. Pregnancies
  2. Glucose
  3. BloodPressure
  4. SkinThickness
  5. Insulin
  6. BMI
  7. DiabetesPedigreeFunction
  8. Age

Dependent Variable: ​

Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes') ​

data <- read.csv("R_337_diabetes.csv") glimpse(data)

Rows: 768
Columns: 9
$ Pregnancies               6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, ...
$ Glucose                   148, 85, 183, 89, 137, 116, 78, 115, 197, ...
$ BloodPressure             72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92,...
$ SkinThickness             35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, ...
$ Insulin                   0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, ...
$ BMI                       33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, ...
$ DiabetesPedigreeFunction  0.627, 0.351, 0.672, 0.167, 2.288, 0.201, ...
$ Age                       50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30...
$ Outcome                   1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...

summary(data) # returns the statistical summary of the data columns

Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
    Outcome     
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000  

dim(data)

768   9

# Converting the dependent variable into factor levels data$Outcome = as.factor(data$Outcome)

STEP 3: Train Test Split

# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%) parts = createDataPartition(data$Outcome, p = .8, list = F) train = data[parts, ] test = data[-parts, ] X_train = train[,-9] y_train = train[,9]

STEP 4: Performing recursive feature elimination

We will use rfe() function from CARET package to implement Recursive Feature elimination.

Syntax: ref(x, y, sizes = , rfecontrol =)

where:

  1. x = dataframe or matrix of features
  2. y = target variable
  3. sizes = number of features that needs to be selected
  4. KMeans Clustering: commonly used when we have large dataset
  5. rfeControl : list of control options such as algorithm, cross validation etc.

# specifying the CV technique as well as the random forest algorithm which will be passed into the rfe() function in feature selection control_rfe = rfeControl(functions = rfFuncs, # random forest method = "repeatedcv", # repeated cv repeats = 5, # number of repeats number = 10) # number of folds set.seed(50) # Performing RFE result_rfe = rfe(x = X_train, y = y_train, sizes = c(1:8), rfeControl = control_rfe) # summarising the results result_rfe

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         1   0.6946 0.2827    0.05737 0.13473         
         2   0.7063 0.3356    0.05013 0.11179         
         3   0.7256 0.3811    0.04727 0.10913         
         4   0.7337 0.4012    0.05000 0.11852         
         5   0.7438 0.4255    0.04595 0.09999         
         6   0.7474 0.4261    0.04910 0.10992         
         7   0.7499 0.4349    0.04987 0.10736         
         8   0.7639 0.4606    0.04607 0.10471        *

The top 5 variables (out of 8):
   Glucose, BMI, Age, Pregnancies, Insulin

Note: Top 5 based on the accuracy is Glucose, BMI, Age, Pregnancies, Insulin out of the 8 features chosen.

# all the features selected by rfe predictors(result_rfe)

Glucose' 'BMI' 'Age' 'Pregnancies' 'Insulin' 'DiabetesPedigreeFunction' 'SkinThickness' 'BloodPressure'

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Hands-On Approach to Causal Inference in Machine Learning
In this Machine Learning Project, you will learn to implement various causal inference techniques in Python to determine, how effective the sprinkler is in making the grass wet.

Build a Multi-Class Classification Model in Python on Saturn Cloud
In this machine learning classification project, you will build a multi-class classification model in Python on Saturn Cloud to predict the license status of a business.

Word2Vec and FastText Word Embedding with Gensim in Python
In this NLP Project, you will learn how to use the popular topic modelling library Gensim for implementing two state-of-the-art word embedding methods Word2Vec and FastText models.

Multi-Class Text Classification with Deep Learning using BERT
In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

Build a Graph Based Recommendation System in Python-Part 2
In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.

AWS MLOps Project for ARCH and GARCH Time Series Models
Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .

ML Model Deployment on AWS for Customer Churn Prediction
MLOps Project-Deploy Machine Learning Model to Production Python on AWS for Customer Churn Prediction

Build Regression Models in Python for House Price Prediction
In this Machine Learning Regression project, you will build and evaluate various regression models in Python for house price prediction.

Personalized Medicine: Redefining Cancer Treatment
In this Personalized Medicine Machine Learning Project you will learn to classify genetic mutations on the basis of medical literature into 9 classes.

Linear Regression Model Project in Python for Beginners Part 1
Machine Learning Linear Regression Project in Python to build a simple linear regression model and master the fundamentals of regression for beginners.