How to optimise size depth of trees in XGBoost in python

This recipe helps you optimise size (depth) of trees in XGBoost in python

Recipe Objective

Many a times while working on a dataset and using a Machine Learning model we don"t know which set of hyperparameters will give us the best result. Passing all sets of hyperparameters manually through the model and checking the result might be a hectic work and may not be possible to do.

To get the best set of hyperparameters we can use Grid Search. Grid Search passes all combinations of hyperparameters one by one into the model and check the result. Finally it gives us the set of hyperparemeters which gives the best result after passing in the model.

So this recipe is a short example of how can optimise size depth of trees in XGBoost.

Step 1 - Import the library - GridSearchCv

from sklearn import datasets from sklearn.model_selection import train_test_split from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold import matplotlib matplotlib.use("Agg") from matplotlib import pyplot

Here we have imported various modules like StratifiedKFold, datasets, XGBClassifier and GridSearchCV from differnt libraries. We will understand the use of these later while using it in the in the code snipet.
For now just have a look on these imports.

Step 2 - Setup the Data

Here we have used datasets to load the inbuilt wine dataset and we have created objects X and y to store the data and the target value respectively. We have used test_train_split to split the dataset. dataset = datasets.load_wine() X = dataset.data; y = dataset.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Step 3 - Using GridSearchCV

We have used XGBClassifier as a model. We have created a dictionary param_grid with parameters which we wabt to optimise. Finally we have used GridSearchCV to train and fit. Before using GridSearchCV, lets have a look on the important parameters.

  • estimator: In this we have to pass the models or functions on which we want to use GridSearchCV
  • param_grid: Dictionary or list of parameters of models or function in which GridSearchCV have to select the best.
  • Scoring: It is used as a evaluating metric for the model performance to decide the best hyperparameters, if not especified then it uses estimator score.

model = XGBClassifier() max_depth = range(1, 11, 2) print(max_depth) param_grid = dict(max_depth=max_depth) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1) grid_result = grid_search.fit(X, y)

Step 4 - Printing Results

Now we are using print statements to print the results. It will give the values of hyperparameters as a result. print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_["mean_test_score"] stds = grid_result.cv_results_["std_test_score"] params = grid_result.cv_results_["params"] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) As an output we get:

range(1, 11, 2)
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   29.2s

Best: -0.069259 using {"max_depth": 1}

-0.069259 (0.034427) with: {"max_depth": 1}
-0.083225 (0.059937) with: {"max_depth": 3}
-0.086606 (0.061344) with: {"max_depth": 5}
-0.086606 (0.061344) with: {"max_depth": 7}
-0.086606 (0.061344) with: {"max_depth": 9}
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   29.4s finished

Download Materials

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Deploy Transformer BART Model for Text summarization on GCP
Learn to Deploy a Machine Learning Model for the Abstractive Text Summarization on Google Cloud Platform (GCP)

Learn to Build a Polynomial Regression Model from Scratch
In this Machine Learning Regression project, you will learn to build a polynomial regression model to predict points scored by the sports team.

AWS MLOps Project for Gaussian Process Time Series Modeling
MLOps Project to Build and Deploy a Gaussian Process Time Series Model in Python on AWS

MLOps Project to Deploy Resume Parser Model on Paperspace
In this MLOps project, you will learn how to deploy a Resume Parser Streamlit Application on Paperspace Private Cloud.

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

Build a Multi Touch Attribution Machine Learning Model in Python
Identifying the ROI on marketing campaigns is an essential KPI for any business. In this ML project, you will learn to build a Multi Touch Attribution Model in Python to identify the ROI of various marketing efforts and their impact on conversions or sales..

Personalized Medicine: Redefining Cancer Treatment
In this Personalized Medicine Machine Learning Project you will learn to classify genetic mutations on the basis of medical literature into 9 classes.

Build a Multi Class Image Classification Model Python using CNN
This project explains How to build a Sequential Model that can perform Multi Class Image Classification in Python using CNN

Build Time Series Models for Gaussian Processes in Python
Time Series Project - A hands-on approach to Gaussian Processes for Time Series Modelling in Python

PyCaret Project to Build and Deploy an ML App using Streamlit
In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.