Explain stratified K fold cross validation ?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

Explain stratified K fold cross validation ?

Explain stratified K fold cross validation ?

This recipe explains stratified K fold cross validation

Recipe Objective

Stratified K fold cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. It provides train/test indices to split data in train/test sets.

So this recipe is a short example on what is stratified K fold cross validation . Let's get started.

Step 1 - Import the library

from sklearn import datasets from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold from statistics import mean

Let's pause and look at these imports. Here sklearn.dataset is used to import one classification based model dataset. Also, we have exported LogisticRegression to build the model. Now StratifiedKFold will help us in performing Stratified K fold cross-validation.

Step 2 - Setup the Data

X,y=load_breast_cancer(return_X_y=True)

Here, we have used load_breast_cancer function to import our dataset in two list form (X and y) and therefore kept return_X_y to be True.

Now our dataset is ready.

Step 3 - Building the model and Cross Validation model

model = LogisticRegression() skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1) lst_accu_stratified = []

We have simply built a regressor model with LogisticRegression with default values. Now for StratifiedKFold, we have kept n_splits to be 10, dividing our dataset for 10 times. Also, the shuffling is kept to be True.

Step 4 - Building Stratified K fold cross validation

for train_index, test_index in skf.split(X, y): X_train_fold, X_test_fold = X[train_index], X[test_index] y_train_fold, y_test_fold = y[train_index], y[test_index] model.fit(X_train_fold, y_train_fold) lst_accu_stratified.append(model.score(X_test_fold, y_test_fold))

skf.split has divided our model into 10 random index set. We have then fit our model at each set and thereby calculated accuracy score.

Step 5 - Printing the results

print('Maximum Accuracy',max(lst_accu_stratified)) print('Minimum Accuracy:',min(lst_accu_stratified)) print('Overall Accuracy:',mean(lst_accu_stratified))

Here we have maximum accuracy, minimum accuracy and average accuracy across 10 fold validation set.

Step 6 - Lets look at our dataset now

Once we run the above code snippet, we will see:

Maximum Accuracy 1.0
Minimum Accuracy: 0.9137931034482759
Overall Accuracy: 0.9579185031544378

Clearly, the model performace in quite high in any case across 10 fold stratified cross validation.

Relevant Projects

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Build a Face Recognition System in Python using FaceNet
In this deep learning project, you will build your own face recognition system in Python using OpenCV and FaceNet by extracting features from an image of a person's face.

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

Build a Similar Images Finder with Python, Keras, and Tensorflow
Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Data Science Project in Python on BigMart Sales Prediction
The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Digit Recognition using CNN for MNIST Dataset in Python
In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.