How to deal with imbalance classes with upsampling in Python?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to deal with imbalance classes with upsampling in Python?

How to deal with imbalance classes with upsampling in Python?

This recipe helps you deal with imbalance classes with upsampling in Python

0

Recipe Objective

While working on classification problem have you ever come across a imbalance dataset which contains most samples of a particular class. So to transform the dataset such that it contains equal number of classes in target value we can upsample the dataset. Upsampling means to increse the number of samples which are less in number.

This data science python source code does the following:
1. Imports necessary libraries and iris data from sklearn dataset
2. Use of "where" function for data handling
3. Upsamples the lower class to balance the data

So this is the recipe on how we can deal with imbalance classes with upsampling in Python.

Step 1 - Import the library

import numpy as np from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported inbuilt wine datset form the datasets module and stored the data in x and target in y. This dataset is not bias so we are making it bias for better understanding of the functions, we have removed first 30 rows by selecting the rows after the 30 rows. Then in the selected data we have changed the class which are not 0 to 1. wine = load_wine() X = wine.data y = wine.target X = X[30:,:] y = y[30:] y = np.where((y == 0), 0, 1)

Step 3 - Upsampling the dataset

First we are selecting the rows where target values are 0 and 1 in two different objects and then printing the number of observations in the two objects. i_class0 = np.where(y == 0)[0] i_class1 = np.where(y == 1)[0] s_class0 = len(i_class0); print(); print("s_class0: ", s_class0) s_class1 = len(i_class1); print(); print("s_class1: ", s_class1) In the output we will see the number of samples having target values as 1 are much more greater than 0. So in upsampling we will increse the number of samples having the target values which are less in numbers. The functions will make dummy samples to make the dataset unbias.
Then we have printed the joint dataset having target class as 0 and 1. i_class0_upsampled = np.random.choice(i_class0, size=s_class1, replace=True) print(np.hstack((y[i_class0_upsampled], y[i_class1]))) So the output comes as:


Viewing at the imbalanced target vector:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

s_class0:  29

s_class1:  119

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Relevant Projects

Predict Census Income using Deep Learning Models
In this project, we are going to work on Deep Learning using H2O to predict Census income.

Music Recommendation System Project using Python and R
Machine Learning Project - Work with KKBOX's Music Recommendation System dataset to build the best music recommendation engine.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Mercari Price Suggestion Challenge Data Science Project
Data Science Project in Python- Build a machine learning algorithm that automatically suggests the right product prices.

Zillow’s Home Value Prediction (Zestimate)
Data Science Project in R -Build a machine learning algorithm to predict the future sale prices of homes.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Data Science Project in Python on BigMart Sales Prediction
The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.