While training a dataset sometimes we need to know how model is training with each row of data passed through it. Sometimes while training a very large dataset it takes a lots of time and for that we want to know that after passing speicific percentage of dataset what is the score of the model. So this can be done by learning curve.
This data science python source code does the following:
1. Imports Digit dataset and necessary libraries
2. Imports Learning curve function for visualization
3. Splits dataset into train and test
4. Plots graphs using matplotlib to analyze the learning curve
So this recipe is a short example of how we can plot a learning Curve in Python.
import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier from sklearn import datasets from sklearn.model_selection import learning_curve
Here we have imported various modules like datasets, RandomForestClassifier and learning_curve from differnt libraries. We will understand the use of these later while using it in the in the code snippet.
For now just have a look on these imports.
Here we have used datasets to load the inbuilt breast cancer dataset and we have created objects X and y to store the data and the target value respectively.
cancer = datasets.load_breast_cancer()
X, y = cancer.data, cancer.target
Here, we are using Learning curve to get train_sizes, train_score and test_score. Before using Learning Curve let us have a look on its parameters.
train_sizes, train_scores, test_scores = learning_curve(RandomForestClassifier(), X, y, cv=10, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.01, 1.0, 50))Now we have calculated the mean and standard deviation of the train and test scores.
train_mean = np.mean(train_scores, axis=1) train_std = np.std(train_scores, axis=1) test_mean = np.mean(test_scores, axis=1) test_std = np.std(test_scores, axis=1)
Finally, its time to plot the learning curve. We have used matplotlib to plot lines and band of the learning curve.
plt.plot(train_sizes, train_mean, '--', color="#111111", label="Training score")
plt.plot(train_sizes, test_mean, color="#111111", label="Cross-validation score")
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, color="#DDDDDD")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
As an output we get: