How to evaluate XGBoost model with learning curves example 1?

How to evaluate XGBoost model with learning curves example 1?

How to evaluate XGBoost model with learning curves example 1?

This recipe helps you evaluate XGBoost model with learning curves example 1


Recipe Objective

While training a dataset sometimes we need to know how model is training with each row of data passed through it. Sometimes while training a very large dataset it takes a lots of time and for that we want to know that after passing speicific percentage of dataset what is the score of the model. So this can be done by learning curve. So here we are evaluating XGBoost with learning curves.

So this recipe is a short example of how we can evaluate XGBoost model with learning curves.

Step 1 - Import the library

import numpy as np from xgboost import XGBClassifier import matplotlib.pyplot as plt'ggplot') from sklearn import datasets import matplotlib.pyplot as plt from sklearn.model_selection import learning_curve

Here we have imported various modules like datasets, XGBClassifier and learning_curve from differnt libraries. We will understand the use of these later while using it in the in the code snippet.
For now just have a look on these imports.

Step 2 - Setup the Data

Here we have used datasets to load the inbuilt wine dataset and we have created objects X and y to store the data and the target value respectively. dataset = datasets.load_wine() X =; y =

Step 3 - Learning Curve and Scores

Here, we are using Learning curve to get train_sizes, train_score and test_score. Before using Learning Curve let us have a look on its parameters.

  • estimator: In this we have to pass the models or functions on which we want to use Learning
  • train_sizes: Relative or absolute numbers of training examples that will be used to generate the learning curve.
  • Scoring: It is used as a evaluating metric for the model performance to decide the best hyperparameters, if not especified then it uses estimator score.
  • cv : In this we have to pass a interger value, as it signifies the number of splits that is needed for cross validation. By default is set as five.
  • n_jobs : This signifies the number of jobs to be run in parallel, -1 signifies to use all processor.
train_sizes, train_scores, test_scores = learning_curve(XGBClassifier(), X, y, cv=10, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.01, 1.0, 50)) Now we have calculated the mean and standard deviation of the train and test scores. train_mean = np.mean(train_scores, axis=1) train_std = np.std(train_scores, axis=1) test_mean = np.mean(test_scores, axis=1) test_std = np.std(test_scores, axis=1)

Step 4 - Ploting the Learning Curve

Finally, its time to plot the learning curve. We have used matplotlib to plot lines and band of the learning curve. plt.subplots(1, figsize=(7,7)) plt.plot(train_sizes, train_mean, '--', color="#111111", label="Training score") plt.plot(train_sizes, test_mean, color="#111111", label="Cross-validation score") plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD") plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, color="#DDDDDD") plt.title("Learning Curve") plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best") plt.tight_layout(); The output can be seen below in the code execution

Relevant Projects

Data Science Project - Instacart Market Basket Analysis
Data Science Project - Build a recommendation engine which will predict the products to be purchased by an Instacart consumer again.

Identifying Product Bundles from Sales Data Using R Language
In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Zillow’s Home Value Prediction (Zestimate)
Data Science Project in R -Build a machine learning algorithm to predict the future sale prices of homes.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Human Activity Recognition Using Multiclass Classification in Python
In this human activity recognition project, we use multiclass classification machine learning techniques to analyse fitness dataset from a smartphone tracker.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.