How to deal with imbalance classes with downsampling in Python?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

# How to deal with imbalance classes with downsampling in Python?

This recipe helps you deal with imbalance classes with downsampling in Python

0

## Recipe Objective

While working on classification problem have you ever come across a bias dataset which contains most samples of a particular class. So to transform the dataset such that it contains equal number of classes in target value we can downsample the dataset. Downsampling means to reduce the number of samples having the bias class.

This data science python source code does the following:
1. Imports necessary libraries and iris data from sklearn dataset
2. Use of "where" function for data handling
3. Downsamples the higher class to balance the data

So this is the recipe on how we can deal with imbalance classes with downsampling in Python.

## Step 1 - Import the library

``` import numpy as np from sklearn import datasets ```

We have imported numpy and datasets modules.

## Step 2 - Setting up the Data

We have imported inbuilt wine datset form the datasets module and stored the data in x and target in y. This dataset is not bias so we are making it bias for better understanding of the functions, we have removed first 30 rows by selecting the rows after the 30 rows. Then in the selected data we have changed the class which are not 0 to 1. ``` wine = datasets.load_wine() X = wine.data y = wine.target X = X[30:,:] y = y[30:] y = np.where((y == 0), 0, 1) print("Viewing the imbalanced target vector:\n", y) ```

## Step 3 - Downsampling the dataset

First we are selecting the rows where target values are 0 and 1 in two different objects and then printing the number of observations in the two objects. ``` w_class0 = np.where(y == 0) w_class1 = np.where(y == 1) n_class0 = len(w_class0) n_class1 = len(w_class1) print("n_class0: ", n_class0) print("n_class1: ", n_class1) ``` In the output we will see the number of samples having target values as 1 are much more greater than 0. So in downsampling we will randomly select the number of rows having target as 1 and make it equal to the number of rows having taregt values 0.
Then we have printed the joint dataset having target class as 0 and 1. ``` w_class1_downsampled = np.random.choice(w_class1, size=n_class0, replace=False) print(); print(np.hstack((y[w_class0], y[w_class1_downsampled]))) ``` So the output comes as:

```Viewing the imbalanced target vector:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

n_class0:  29

n_class1:  119

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
```

#### Relevant Projects

##### Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

##### Data Science Project in Python on BigMart Sales Prediction
The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

##### Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

##### Music Recommendation System Project using Python and R
Machine Learning Project - Work with KKBOX's Music Recommendation System dataset to build the best music recommendation engine.

##### Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

##### Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

##### Learn to prepare data for your next machine learning project
Text data requires special preparation before you can start using it for any machine learning project.In this ML project, you will learn about applying Machine Learning models to create classifiers and learn how to make sense of textual data.

##### PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

##### Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

##### German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.