Explain stratified K fold cross validation ?

Explain stratified K fold cross validation ?

Explain stratified K fold cross validation ?

This recipe explains stratified K fold cross validation


Recipe Objective

Stratified K fold cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. It provides train/test indices to split data in train/test sets.

So this recipe is a short example on what is stratified K fold cross validation . Let's get started.

Step 1 - Import the library

from sklearn import datasets from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold from statistics import mean

Let's pause and look at these imports. Here sklearn.dataset is used to import one classification based model dataset. Also, we have exported LogisticRegression to build the model. Now StratifiedKFold will help us in performing Stratified K fold cross-validation.

Step 2 - Setup the Data


Here, we have used load_breast_cancer function to import our dataset in two list form (X and y) and therefore kept return_X_y to be True.

Now our dataset is ready.

Step 3 - Building the model and Cross Validation model

model = LogisticRegression() skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1) lst_accu_stratified = []

We have simply built a regressor model with LogisticRegression with default values. Now for StratifiedKFold, we have kept n_splits to be 10, dividing our dataset for 10 times. Also, the shuffling is kept to be True.

Step 4 - Building Stratified K fold cross validation

for train_index, test_index in skf.split(X, y): X_train_fold, X_test_fold = X[train_index], X[test_index] y_train_fold, y_test_fold = y[train_index], y[test_index] model.fit(X_train_fold, y_train_fold) lst_accu_stratified.append(model.score(X_test_fold, y_test_fold))

skf.split has divided our model into 10 random index set. We have then fit our model at each set and thereby calculated accuracy score.

Step 5 - Printing the results

print('Maximum Accuracy',max(lst_accu_stratified)) print('Minimum Accuracy:',min(lst_accu_stratified)) print('Overall Accuracy:',mean(lst_accu_stratified))

Here we have maximum accuracy, minimum accuracy and average accuracy across 10 fold validation set.

Step 6 - Lets look at our dataset now

Once we run the above code snippet, we will see:

Maximum Accuracy 1.0
Minimum Accuracy: 0.9137931034482759
Overall Accuracy: 0.9579185031544378

Clearly, the model performace in quite high in any case across 10 fold stratified cross validation.

Relevant Projects

Sequence Classification with LSTM RNN in Python with Keras
In this project, we are going to work on Sequence to Sequence Prediction using IMDB Movie Review Dataset​ using Keras in Python.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Predict Census Income using Deep Learning Models
In this project, we are going to work on Deep Learning using H2O to predict Census income.

Perform Time series modelling using Facebook Prophet
In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Human Activity Recognition Using Smartphones Data Set
In this deep learning project, you will build a classification system where to precisely identify human fitness activities.

Data Science Project in Python on BigMart Sales Prediction
The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.