Explain stratified K fold cross validation ?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

Explain stratified K fold cross validation ?

Explain stratified K fold cross validation ?

This recipe explains stratified K fold cross validation

0

Recipe Objective

Stratified K fold cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. It provides train/test indices to split data in train/test sets.

So this recipe is a short example on what is stratified K fold cross validation . Let's get started.

Step 1 - Import the library

from sklearn import datasets from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold from statistics import mean

Let's pause and look at these imports. Here sklearn.dataset is used to import one classification based model dataset. Also, we have exported LogisticRegression to build the model. Now StratifiedKFold will help us in performing Stratified K fold cross-validation.

Step 2 - Setup the Data

X,y=load_breast_cancer(return_X_y=True)

Here, we have used load_breast_cancer function to import our dataset in two list form (X and y) and therefore kept return_X_y to be True.

Now our dataset is ready.

Step 3 - Building the model and Cross Validation model

model = LogisticRegression() skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1) lst_accu_stratified = []

We have simply built a regressor model with LogisticRegression with default values. Now for StratifiedKFold, we have kept n_splits to be 10, dividing our dataset for 10 times. Also, the shuffling is kept to be True.

Step 4 - Building Stratified K fold cross validation

for train_index, test_index in skf.split(X, y): X_train_fold, X_test_fold = X[train_index], X[test_index] y_train_fold, y_test_fold = y[train_index], y[test_index] model.fit(X_train_fold, y_train_fold) lst_accu_stratified.append(model.score(X_test_fold, y_test_fold))

skf.split has divided our model into 10 random index set. We have then fit our model at each set and thereby calculated accuracy score.

Step 5 - Printing the results

print('Maximum Accuracy',max(lst_accu_stratified)) print('Minimum Accuracy:',min(lst_accu_stratified)) print('Overall Accuracy:',mean(lst_accu_stratified))

Here we have maximum accuracy, minimum accuracy and average accuracy across 10 fold validation set.

Step 6 - Lets look at our dataset now

Once we run the above code snippet, we will see:

Maximum Accuracy 1.0
Minimum Accuracy: 0.9137931034482759
Overall Accuracy: 0.9579185031544378

Clearly, the model performace in quite high in any case across 10 fold stratified cross validation.

Relevant Projects

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.

Data Science Project in Python on BigMart Sales Prediction
The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.