Explain stratified K fold cross validation in ML in python

This recipe explains stratified K fold cross validation in ML in python
Last Updated: 21 Dec 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

Stratified K fold cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. It provides train/test indices to split data in train/test sets.

So this recipe is a short example on what is stratified K fold cross validation . Let's get started.

Recipe Objective

Step 1 - Import the library

from sklearn import datasets from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold from statistics import mean

Let's pause and look at these imports. Here sklearn.dataset is used to import one classification based model dataset. Also, we have exported LogisticRegression to build the model. Now StratifiedKFold will help us in performing Stratified K fold cross-validation.

Step 2 - Setup the Data

X,y=load_breast_cancer(return_X_y=True)

Here, we have used load_breast_cancer function to import our dataset in two list form (X and y) and therefore kept return_X_y to be True.

Now our dataset is ready.

Step 3 - Building the model and Cross Validation model

model = LogisticRegression() skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1) lst_accu_stratified = []

We have simply built a regressor model with LogisticRegression with default values. Now for StratifiedKFold, we have kept n_splits to be 10, dividing our dataset for 10 times. Also, the shuffling is kept to be True.

Step 4 - Building Stratified K fold cross validation

for train_index, test_index in skf.split(X, y): X_train_fold, X_test_fold = X[train_index], X[test_index] y_train_fold, y_test_fold = y[train_index], y[test_index] model.fit(X_train_fold, y_train_fold) lst_accu_stratified.append(model.score(X_test_fold, y_test_fold))

skf.split has divided our model into 10 random index set. We have then fit our model at each set and thereby calculated accuracy score.

Step 5 - Printing the results

print('Maximum Accuracy',max(lst_accu_stratified)) print('Minimum Accuracy:',min(lst_accu_stratified)) print('Overall Accuracy:',mean(lst_accu_stratified))

Here we have maximum accuracy, minimum accuracy and average accuracy across 10 fold validation set.

Step 6 - Lets look at our dataset now

Once we run the above code snippet, we will see:

Maximum Accuracy 1.0
Minimum Accuracy: 0.9137931034482759
Overall Accuracy: 0.9579185031544378

Clearly, the model performace in quite high in any case across 10 fold stratified cross validation.

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More