How to deal with imbalance classes with upsampling in Python?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to deal with imbalance classes with upsampling in Python?

How to deal with imbalance classes with upsampling in Python?

This recipe helps you deal with imbalance classes with upsampling in Python

0

Recipe Objective

While working on classification problem have you ever come across a imbalance dataset which contains most samples of a particular class. So to transform the dataset such that it contains equal number of classes in target value we can upsample the dataset. Upsampling means to increse the number of samples which are less in number.

This data science python source code does the following:
1. Imports necessary libraries and iris data from sklearn dataset
2. Use of "where" function for data handling
3. Upsamples the lower class to balance the data

So this is the recipe on how we can deal with imbalance classes with upsampling in Python.

Step 1 - Import the library

import numpy as np from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported inbuilt wine datset form the datasets module and stored the data in x and target in y. This dataset is not bias so we are making it bias for better understanding of the functions, we have removed first 30 rows by selecting the rows after the 30 rows. Then in the selected data we have changed the class which are not 0 to 1. wine = load_wine() X = wine.data y = wine.target X = X[30:,:] y = y[30:] y = np.where((y == 0), 0, 1)

Step 3 - Upsampling the dataset

First we are selecting the rows where target values are 0 and 1 in two different objects and then printing the number of observations in the two objects. i_class0 = np.where(y == 0)[0] i_class1 = np.where(y == 1)[0] s_class0 = len(i_class0); print(); print("s_class0: ", s_class0) s_class1 = len(i_class1); print(); print("s_class1: ", s_class1) In the output we will see the number of samples having target values as 1 are much more greater than 0. So in upsampling we will increse the number of samples having the target values which are less in numbers. The functions will make dummy samples to make the dataset unbias.
Then we have printed the joint dataset having target class as 0 and 1. i_class0_upsampled = np.random.choice(i_class0, size=s_class1, replace=True) print(np.hstack((y[i_class0_upsampled], y[i_class1]))) So the output comes as:


Viewing at the imbalanced target vector:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

s_class0:  29

s_class1:  119

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Relevant Projects

Sequence Classification with LSTM RNN in Python with Keras
In this project, we are going to work on Sequence to Sequence Prediction using IMDB Movie Review Dataset​ using Keras in Python.

Machine Learning or Predictive Models in IoT - Energy Prediction Use Case
In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Predict Census Income using Deep Learning Models
In this project, we are going to work on Deep Learning using H2O to predict Census income.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.