How to deal with imbalance classes with downsampling in Python?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to deal with imbalance classes with downsampling in Python?

How to deal with imbalance classes with downsampling in Python?

This recipe helps you deal with imbalance classes with downsampling in Python

0

Recipe Objective

While working on classification problem have you ever come across a bias dataset which contains most samples of a particular class. So to transform the dataset such that it contains equal number of classes in target value we can downsample the dataset. Downsampling means to reduce the number of samples having the bias class.

This data science python source code does the following:
1. Imports necessary libraries and iris data from sklearn dataset
2. Use of "where" function for data handling
3. Downsamples the higher class to balance the data

So this is the recipe on how we can deal with imbalance classes with downsampling in Python.

Step 1 - Import the library

import numpy as np from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported inbuilt wine datset form the datasets module and stored the data in x and target in y. This dataset is not bias so we are making it bias for better understanding of the functions, we have removed first 30 rows by selecting the rows after the 30 rows. Then in the selected data we have changed the class which are not 0 to 1. wine = datasets.load_wine() X = wine.data y = wine.target X = X[30:,:] y = y[30:] y = np.where((y == 0), 0, 1) print("Viewing the imbalanced target vector:\n", y)

Step 3 - Downsampling the dataset

First we are selecting the rows where target values are 0 and 1 in two different objects and then printing the number of observations in the two objects. w_class0 = np.where(y == 0)[0] w_class1 = np.where(y == 1)[0] n_class0 = len(w_class0) n_class1 = len(w_class1) print("n_class0: ", n_class0) print("n_class1: ", n_class1) In the output we will see the number of samples having target values as 1 are much more greater than 0. So in downsampling we will randomly select the number of rows having target as 1 and make it equal to the number of rows having taregt values 0.
Then we have printed the joint dataset having target class as 0 and 1. w_class1_downsampled = np.random.choice(w_class1, size=n_class0, replace=False) print(); print(np.hstack((y[w_class0], y[w_class1_downsampled]))) So the output comes as:

Viewing the imbalanced target vector:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

n_class0:  29

n_class1:  119

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Relevant Projects

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Solving Multiple Classification use cases Using H2O
In this project, we are going to talk about H2O and functionality in terms of building Machine Learning models.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Human Activity Recognition Using Smartphones Data Set
In this deep learning project, you will build a classification system where to precisely identify human fitness activities.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.