How to implement Xgboost in R

Xgboost is an ensemble machine learning algorithm that uses gradient boosting. In this recipe, we shall learn how to make use of Xgboost classifier in R.
Last Updated: 05 Sep 2022

Get access to Data Science projects View all Data Science projects

DATA SCIENCE PROJECTS IN R DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to implement Xgboost in R?

Xgboost is an ensemble machine learning algorithm that uses gradient boosting. Its goal is to optimize both the model performance and the execution speed. It can be used for both regression and classification problems. The steps to implement Xgboost in R are as follows-

Learn to Implement Customer Churn Prediction Using Machine Learning in Python

Recipe Objective: How to implement Xgboost in R?

Step 1: Load the required packages

#importing required packages library(modeldata) library(dplyr) library(fastDummies) library(xgboost) library(caret)

Step 2: Load the dataset

#loading the dataset data("stackoverflow") data <- stackoverflow

Stackoverflow is an Annual Stack Overflow Developer Survey Data. This data is a collection of 5,594 data points collected on developers. This data could be used to try to predict who works remotely.

Step 3: Train-Test split

#train-test split require(caTools) set.seed(101) sample = sample.split(data$Remote, SplitRatio = .75) train = subset(data, sample == TRUE) test = subset(data, sample == FALSE)

Step 4: Isolating X and y labels

#isolating y train_y <- as.numeric(train$Remote) train_y <- as.numeric(train$Remote)-1 test_y <- as.numeric(test$Remote)-1 #isolating X train_X <- train %>% select(-Remote) test_X <- test %>% select(-Remote)

Step 5: Check structure of x labels

#checking structure of X str(train_X)

	tibble [4,195 x 20] (S3: tbl_df/tbl/data.frame)
 $ Country                             : Factor w/ 5 levels "Canada","Germany",..: 4 5 5 2 3 5 5 2 5 5 ...
 $ Salary                              : num [1:4195] 100000 130000 175000 64516 6636 ...
 $ YearsCodedJob                       : int [1:4195] 20 20 16 4 1 1 13 4 7 1 ...
 $ OpenSource                          : num [1:4195] 0 1 0 0 0 0 0 1 1 0 ...
 $ Hobby                               : num [1:4195] 1 1 1 0 1 1 1 0 1 0 ...
 $ CompanySizeNumber                   : num [1:4195] 5000 1000 10000 1000 5000 20 20 5000 20 20 ...
 $ CareerSatisfaction                  : int [1:4195] 8 9 7 9 5 8 7 7 8 10 ...
 $ Data_scientist                      : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Database_administrator              : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Desktop_applications_developer      : num [1:4195] 0 0 0 0 0 0 0 0 1 0 ...
 $ Developer_with_stats_math_background: num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ DevOps                              : num [1:4195] 0 1 0 0 0 0 0 0 0 0 ...
 $ Embedded_developer                  : num [1:4195] 1 1 0 0 0 0 0 0 0 0 ...
 $ Graphic_designer                    : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Graphics_programming                : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Machine_learning_specialist         : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Mobile_developer                    : num [1:4195] 0 0 0 0 0 0 0 0 1 0 ...
 $ Quality_assurance_engineer          : num [1:4195] 0 1 0 0 0 0 0 0 0 0 ...
 $ Systems_administrator               : num [1:4195] 0 0 0 0 0 0 0 0 0 0 ...
 $ Web_developer                       : num [1:4195] 0 1 1 1 1 1 1 1 0 1 ...

It can be seen that all the variables except county are either numeric type or integer type. Xgboost does not deal with factors, so every element needs to be transformed into a dummy variable.

Step 6: Transform factor variable into a dummy variable

#transform factor into a dummy variable train_X <- dummy_cols(train_X,remove_first_dummy = TRUE) train_X <- train_X %>% select(-Country) #doing the same with test_X test_X <- dummy_cols(test_X,remove_first_dummy = TRUE) test_X <- test_X %>% select(-Country)

Step 7: Define parameters

#setting the parameters params <- list(set.seed=1999, eval_metric = "auc", objective = "binary:logistic")

Step 8: Train the model

#running xgboost model <- xgboost(data=as.matrix(train_X), label = train_y, params = params, nrounds = 20, verbose = 1)

Step 9: Evaluate the model

#evaluate model predictions = predict(model,newdata=as.matrix(test_X)) predictions = ifelse(predictions>0.5,1,0)

Step 10: Check the accuracy

#check the accuracy confusionMatrix(table(predictions,test_y))

	Confusion Matrix and Statistics

           test_y
predictions    0    1
          0    4    9
          1  140 1246
                                          
               Accuracy : 0.8935          
                 95% CI : (0.8761, 0.9092)
    No Information Rate : 0.8971          
    P-Value [Acc > NIR] : 0.6889          
                                          
                  Kappa : 0.0345          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.027778        
            Specificity : 0.992829        
         Pos Pred Value : 0.307692        
         Neg Pred Value : 0.898990        
             Prevalence : 0.102931        
         Detection Rate : 0.002859        
   Detection Prevalence : 0.009292        
      Balanced Accuracy : 0.510303        
                                          
       'Positive' Class : 0

The accuracy of the model is 89.35%.

Step 11: Check the essential drivers of the model

#look at the most important drivers #shap values xgb.plot.shap(data=as.matrix(test_X), model=model, top_n = 5)

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Recommender System Machine Learning Project for Beginners-3

Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

View Project Details

Ola Bike Rides Request Demand Forecast

Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.

View Project Details

MLOps Project on GCP using Kubeflow for Model Deployment

MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

View Project Details

Build Multi Class Text Classification Models with RNN and LSTM

In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

View Project Details

Build a Graph Based Recommendation System in Python -Part 1

Python Recommender Systems Project - Learn to build a graph based recommendation system in eCommerce to recommend products.

View Project Details

Build Customer Propensity to Purchase Model in Python

In this machine learning project, you will learn to build a machine learning model to estimate customer propensity to purchase.

View Project Details

Model Deployment on GCP using Streamlit for Resume Parsing

Perform model deployment on GCP for resume parsing model using Streamlit App.

View Project Details

Build Portfolio Optimization Machine Learning Models in R

Machine Learning Project for Financial Risk Modelling and Portfolio Optimization with R- Build a machine learning model in R to develop a strategy for building a portfolio for maximized returns.

View Project Details

Langchain Project for Customer Support App in Python

In this LLM Project, you will learn how to enhance customer support interactions through Large Language Models (LLMs), enabling intelligent, context-aware responses. This Langchain project aims to seamlessly integrate LLM technology with databases, PDF knowledge bases, and audio processing agents to create a comprehensive customer support application.

View Project Details

Ecommerce product reviews - Pairwise ranking and sentiment analysis

This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

View Project Details

How to implement Xgboost in R

Recipe Objective: How to implement Xgboost in R?

Table of Contents

Step 1: Load the required packages

Step 2: Load the dataset

Step 3: Train-Test split

Step 4: Isolating X and y labels

Step 5: Check structure of x labels

Step 6: Transform factor variable into a dummy variable

Step 7: Define parameters

Step 8: Train the model

Step 9: Evaluate the model

Step 10: Check the accuracy

Step 11: Check the essential drivers of the model

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects