While working on a dataset we train a model and check its accuracy, if we check the accuracy on the data which we have used for training then the accuracy comes out to be very high because the model have already seen the data. So for real testing we have check the accuracy on unseen data for different parameters of model to get a better view.
This data science python source code does the following:
1. Imports Digit dataset and necessary libraries
2. Imports validation curve function for visualization
3. Splits dataset into train and test
4. Plots graphs using matplotlib to analyze the validation of the model
So this is the recipe on how to use validation curve and we will plot the validation curve.
import matplotlib.pyplot as plt import numpy as np from sklearn import datasets from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import validation_curve
We have imported all the modules that would be needed like numpy, datasets, RandomForestClassifier and validation_curve. We will see the use of each modules step by step further.
We have imported inbuilt iris dataset from the module datasets and stored the data in X and the target in y.
digits = datasets.load_iris()
X, y = digits.data, digits.target
Here we are using RandomForestClassifier so first we have to define a object for the range of parameters on which we have to use the validation curve. So we have created an object param_range for that.
Now before using Validation curve, let us first see its parameters:
param_range = np.arange(1, 250, 2) train_scores, test_scores = validation_curve(RandomForestClassifier(), X, y, param_name="n_estimators", param_range=param_range, cv=4, scoring="accuracy", n_jobs=-1)
Now we are calculating the mean and standard deviation of the training and testing scores.
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
First we are plotting the mean accuracy scores for both the training and the testing set. Then the accuracy band for the training and testing sets. Finally the few lines is of the other setting like size , legend etc for the plot.
plt.plot(param_range, train_mean, label="Training score", color="black")
plt.plot(param_range, test_mean, label="Cross-validation score", color="dimgrey")
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color="gray")
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color="gainsboro")
plt.title("Validation Curve With Random Forest")
plt.xlabel("Number Of Trees")