How to use RFE in R?

This recipe helps you use RFE in R
Last Updated: 26 Dec 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

One of the major challenges in building an model is to keep the model simple for better interpretabililty. The complexity of the model is measured by the number of features (i.e. predictors). Large number of predictors increases the complexity of the model leading to multicollinearity and overfitting. Thus, it is important to find an optimum number of features.

Recursive feature elimination is one of the wrapper methods of feature selection which uses a machine learning algorithm to find the best features. Recursive feature elimination performs a greedy search to find the best performing feature subset. It iteratively creates models and determines the best or the worst performing feature at each iteration. It constructs the subsequent models with the left features until all the features are explored. It then ranks the features based on the order of their elimination. In the worst case, if a dataset contains N number of features RFE will do a greedy search for 2N combinations of features.

In this recipe, we will only focus on performing recursive feature selection by performing random forest algorithm using rfe() function in Caret package. This is carried out on a famous dataset by National institute of Diabetes and Digestive and Kidney Diseases.

Should I Learn Python or R for Data Science? Unlock the Answer

Recipe Objective

STEP 1: Importing Necessary Libraries

install.packages('caret') # for general data preparation and model fitting library(caret) # for data manipulation library(tidyverse)

STEP 2: Read a csv file and explore the data

Data Description: This datasets consist of several medical predictor variables (also known as the independent variables) and one target variable (Outcome).

Independent Variables:

Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
DiabetesPedigreeFunction
Age

Dependent Variable:

Outcome ( 0 = 'does not have diabetes', 1 = 'Has diabetes')

data <- read.csv("R_337_diabetes.csv") glimpse(data)

Rows: 768
Columns: 9
$ Pregnancies               6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, ...
$ Glucose                   148, 85, 183, 89, 137, 116, 78, 115, 197, ...
$ BloodPressure             72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92,...
$ SkinThickness             35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, ...
$ Insulin                   0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, ...
$ BMI                       33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, ...
$ DiabetesPedigreeFunction  0.627, 0.351, 0.672, 0.167, 2.288, 0.201, ...
$ Age                       50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30...
$ Outcome                   1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...

summary(data) # returns the statistical summary of the data columns

Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
    Outcome     
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000

dim(data)

768   9

# Converting the dependent variable into factor levels data$Outcome = as.factor(data$Outcome)

STEP 3: Train Test Split

# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%) parts = createDataPartition(data$Outcome, p = .8, list = F) train = data[parts, ] test = data[-parts, ] X_train = train[,-9] y_train = train[,9]

STEP 4: Performing recursive feature elimination

We will use rfe() function from CARET package to implement Recursive Feature elimination.

Syntax: ref(x, y, sizes = , rfecontrol =)

where:

x = dataframe or matrix of features
y = target variable
sizes = number of features that needs to be selected
KMeans Clustering: commonly used when we have large dataset
rfeControl : list of control options such as algorithm, cross validation etc.

# specifying the CV technique as well as the random forest algorithm which will be passed into the rfe() function in feature selection control_rfe = rfeControl(functions = rfFuncs, # random forest method = "repeatedcv", # repeated cv repeats = 5, # number of repeats number = 10) # number of folds set.seed(50) # Performing RFE result_rfe = rfe(x = X_train, y = y_train, sizes = c(1:8), rfeControl = control_rfe) # summarising the results result_rfe

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         1   0.6946 0.2827    0.05737 0.13473         
         2   0.7063 0.3356    0.05013 0.11179         
         3   0.7256 0.3811    0.04727 0.10913         
         4   0.7337 0.4012    0.05000 0.11852         
         5   0.7438 0.4255    0.04595 0.09999         
         6   0.7474 0.4261    0.04910 0.10992         
         7   0.7499 0.4349    0.04987 0.10736         
         8   0.7639 0.4606    0.04607 0.10471        *

The top 5 variables (out of 8):
   Glucose, BMI, Age, Pregnancies, Insulin

Note: Top 5 based on the accuracy is Glucose, BMI, Age, Pregnancies, Insulin out of the 8 features chosen.

# all the features selected by rfe predictors(result_rfe)

Glucose' 'BMI' 'Age' 'Pregnancies' 'Insulin' 'DiabetesPedigreeFunction' 'SkinThickness' 'BloodPressure'

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More