How to implement Xgboost in R

Xgboost is an ensemble machine learning algorithm that uses gradient boosting. In this recipe, we shall learn how to make use of Xgboost classifier in R.

Recipe Objective: How to implement Xgboost in R?

Xgboost is an ensemble machine learning algorithm that uses gradient boosting. Its goal is to optimize both the model performance and the execution speed. It can be used for both regression and classification problems. The steps to implement Xgboost in R are as follows-

Learn to Implement Customer Churn Prediction Using Machine Learning in Python

Step 1: Load the required packages

#importing required packages
library(modeldata)
library(dplyr)
library(fastDummies)
library(xgboost)
library(caret)

Step 2: Load the dataset

#loading the dataset
data("stackoverflow")
data <- stackoverflow

Stackoverflow is an Annual Stack Overflow Developer Survey Data. This data is a collection of 5,594 data points collected on developers. This data could be used to try to predict who works remotely.

Step 3: Train-Test split

#train-test split
require(caTools)
set.seed(101)
sample = sample.split(data$Remote, SplitRatio = .75)
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)

Step 4: Isolating X and y labels

#isolating y
train_y <- as.numeric(train$Remote)
train_y <- as.numeric(train$Remote)-1
test_y <- as.numeric(test$Remote)-1
#isolating X
train_X <- train %>% select(-Remote)
test_X <- test %>% select(-Remote)

Step 5: Check structure of x labels

#checking structure of X
str(train_X)

	tibble [4,195 x 20] (S3: tbl_df/tbl/data.frame)
 $ Country                             : Factor w/ 5 levels "Canada","Germany",..: 4 5 5 2 3 5 5 2 5 5 ...
 $ Salary                              : num [1:4195] 100000 130000 175000 64516 6636 ...
 $ YearsCodedJob                       : int [1:4195] 20 20 16 4 1 1 13 4 7 1 ...
 $ OpenSource                          : num [1:4195] 0 1 0 0 0 0 0 1 1 0 ...
 $ Hobby                               : num [1:4195] 1 1 1 0 1 1 1 0 1 0 ...
 $ CompanySizeNumber                   : num [1:4195] 5000 1000 10000 1000 5000 20 20 5000 20 20 ...
 $ CareerSatisfaction                  : int [1:4195] 8 9 7 9 5 8 7 7 8 10 ...
 $ Data_scientist                      : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Database_administrator              : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Desktop_applications_developer      : num [1:4195] 0 0 0 0 0 0 0 0 1 0 ...
 $ Developer_with_stats_math_background: num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ DevOps                              : num [1:4195] 0 1 0 0 0 0 0 0 0 0 ...
 $ Embedded_developer                  : num [1:4195] 1 1 0 0 0 0 0 0 0 0 ...
 $ Graphic_designer                    : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Graphics_programming                : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Machine_learning_specialist         : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Mobile_developer                    : num [1:4195] 0 0 0 0 0 0 0 0 1 0 ...
 $ Quality_assurance_engineer          : num [1:4195] 0 1 0 0 0 0 0 0 0 0 ...
 $ Systems_administrator               : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Web_developer                       : num [1:4195] 0 1 1 1 1 1 1 1 0 1 ...

It can be seen that all the variables except county are either numeric type or integer type. Xgboost does not deal with factors, so every element needs to be transformed into a dummy variable.

Step 6: Transform factor variable into a dummy variable

#transform factor into a dummy variable
train_X <- dummy_cols(train_X,remove_first_dummy = TRUE)
train_X <- train_X %>% select(-Country)
#doing the same with test_X
test_X <- dummy_cols(test_X,remove_first_dummy = TRUE)
test_X <- test_X %>% select(-Country)

Step 7: Define parameters

#setting the parameters
params <- list(set.seed=1999,
eval_metric = "auc",
objective = "binary:logistic")

Step 8: Train the model

#running xgboost
model <- xgboost(data=as.matrix(train_X),
label = train_y,
params = params,
nrounds = 20,
verbose = 1)

Step 9: Evaluate the model

#evaluate model
predictions = predict(model,newdata=as.matrix(test_X))
predictions = ifelse(predictions>0.5,1,0)

Step 10: Check the accuracy

#check the accuracy
confusionMatrix(table(predictions,test_y))

	Confusion Matrix and Statistics

           test_y
predictions    0    1
          0    4    9
          1  140 1246
                                          
               Accuracy : 0.8935          
                 95% CI : (0.8761, 0.9092)
    No Information Rate : 0.8971          
    P-Value [Acc > NIR] : 0.6889          
                                          
                  Kappa : 0.0345          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.027778        
            Specificity : 0.992829        
         Pos Pred Value : 0.307692        
         Neg Pred Value : 0.898990        
             Prevalence : 0.102931        
         Detection Rate : 0.002859        
   Detection Prevalence : 0.009292        
      Balanced Accuracy : 0.510303        
                                          
       'Positive' Class : 0  

The accuracy of the model is 89.35%.

Step 11: Check the essential drivers of the model

#look at the most important drivers
#shap values
xgb.plot.shap(data=as.matrix(test_X),
model=model,
top_n = 5)

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

Ola Bike Rides Request Demand Forecast
Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.

MLOps Project on GCP using Kubeflow for Model Deployment
MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

Build a Graph Based Recommendation System in Python -Part 1
Python Recommender Systems Project - Learn to build a graph based recommendation system in eCommerce to recommend products.

Build Customer Propensity to Purchase Model in Python
In this machine learning project, you will learn to build a machine learning model to estimate customer propensity to purchase.

Model Deployment on GCP using Streamlit for Resume Parsing
Perform model deployment on GCP for resume parsing model using Streamlit App.

Build Portfolio Optimization Machine Learning Models in R
Machine Learning Project for Financial Risk Modelling and Portfolio Optimization with R- Build a machine learning model in R to develop a strategy for building a portfolio for maximized returns.

Langchain Project for Customer Support App in Python
In this LLM Project, you will learn how to enhance customer support interactions through Large Language Models (LLMs), enabling intelligent, context-aware responses. This Langchain project aims to seamlessly integrate LLM technology with databases, PDF knowledge bases, and audio processing agents to create a comprehensive customer support application.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.