Generally, for any classification problem, we predict the class value that has the highest probability of being the true class label. However, sometimes, we want to predict the probabilities of a data instance belonging to each class label. This type of problems can easily be handled by calibration curve. It support models with 0 and 1 value only.
So this recipe is a short example on what does caliberation mean. Let's get started.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
Let's pause and look at these imports. We have exported train_test_split which helps in randomly breaking the datset in two parts. Here sklearn.dataset is used to import one classification based model dataset. Also, we have exported calibration_curve to calibrate our model.
X,y=load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
Here, we have used load_iris function to import our dataset in two list form (X and y) and therefore kept return_X_y to be True. Further with have broken down the dataset into 2 parts, train and test with ratio 3:4.
Now our dataset is ready.
model =DecisionTreeClassifier(criterion ='entropy', max_features = 2)
We have simply built a classification model with =DecisionTreeClassifier with criterion as entropy and max_feature to be 2.
model.fit(X_train, y_train)
y_pred= model.predict(X_test)
Here we have simply fit used fit function to fit our model on X_train and y_train. Now, we are predicting the values of X_test using our built model.
x, y = calibration_curve(y_test, y_pred, n_bins = 10, normalize = True)
Now we are calibrating our predicted value to the actual value. n_bins refers for number of bins to discretize the [0, 1] interval. Also, we are normalizing y_pred in the [0,1] interval.
plt.plot([0, 1], [0, 1], linestyle = '--', label = 'Ideally Calibrated')
plt.plot(y, x, marker = '.', label = 'Decision Tree Classifier')
plt.xlabel('Average Predicted Probability in each bin')
plt.ylabel('Ratio of positives')
plt.legend()
plt.show()
Here, first we have plotted the Ideally calibrated curve which will a straight line between 0 and 1. Now, we plot our calibrated curve of this particular model. he x-axis represents the average predicted probability in each bin. The y-axis is the ratio of positives (the proportion of positive predictions).
Once we run the above code snippet, we will see:
Scroll down to the ipython file to visualize the results.
Clearly, the model built is highly efficient on any unknown set.