How to do cost complexity pruning in decision tree regressor
Pruning is the technique used to reduce the problem of overfitting. In pruning, we cut down the selected parts of the tree such as branches, buds, roots to improve the tree structure and promote healthy growth.
import pandas as pd import numpy as np from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import seaborn as sns from sklearn.metrics import accuracy_score
iris = sns.load_dataset('iris') iris.head()
X=iris.drop(columns='species') y=iris['species'] Xtrain, Xtest, ytrain, ytest= train_test_split(X,y, test_size=0.3, random_state=20)
Running the model through Decision Tree Classifier and checking accuracy.
tree= DecisionTreeClassifier() tree.fit(Xtrain, ytrain) ytrain_pred=tree.predict(Xtrain) ytest_pred=tree.predict(Xtest) print(accuracy_score(ytrain,ytrain_pred),accuracy_score(ytest,ytest_pred))
prun = tree.cost_complexity_pruning_path(Xtrain,ytrain) alphas=prun['ccp_alphas'] alphas
Applying pruning to the complete dataset and visualizing the whole process.
train_accuracy, test_accuracy=, for j in alphas: tree= DecisionTreeClassifier(ccp_alpha=j) tree.fit(Xtrain,ytrain) ytrain_pred=tree.predict(Xtrain) ytest_pred=tree.predict(Xtest) train_accuracy.append(accuracy_score(ytrain, ytrain_pred)) test_accuracy.append(accuracy_score(ytest, ytest_pred))
sns.set() plt.figure(figsize=(10,6)) sns.lineplot(y=train_accuracy, x=alphas, label='Training Accuracy') sns.lineplot(y=test_accuracy, x=alphas, label='Testing Accuracy') plt.show()