How to deal with imbalance classes with upsampling in Python?
DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to deal with imbalance classes with upsampling in Python?

How to deal with imbalance classes with upsampling in Python?

This recipe helps you deal with imbalance classes with upsampling in Python

0

Recipe Objective

While working on classification problem have you ever come across a imbalance dataset which contains most samples of a particular class. So to transform the dataset such that it contains equal number of classes in target value we can upsample the dataset. Upsampling means to increse the number of samples which are less in number.

This data science python source code does the following:
1. Imports necessary libraries and iris data from sklearn dataset
2. Use of "where" function for data handling
3. Upsamples the lower class to balance the data

So this is the recipe on how we can deal with imbalance classes with upsampling in Python.

Step 1 - Import the library

import numpy as np from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported inbuilt wine datset form the datasets module and stored the data in x and target in y. This dataset is not bias so we are making it bias for better understanding of the functions, we have removed first 30 rows by selecting the rows after the 30 rows. Then in the selected data we have changed the class which are not 0 to 1. wine = load_wine() X = wine.data y = wine.target X = X[30:,:] y = y[30:] y = np.where((y == 0), 0, 1)

Step 3 - Upsampling the dataset

First we are selecting the rows where target values are 0 and 1 in two different objects and then printing the number of observations in the two objects. i_class0 = np.where(y == 0)[0] i_class1 = np.where(y == 1)[0] s_class0 = len(i_class0); print(); print("s_class0: ", s_class0) s_class1 = len(i_class1); print(); print("s_class1: ", s_class1) In the output we will see the number of samples having target values as 1 are much more greater than 0. So in upsampling we will increse the number of samples having the target values which are less in numbers. The functions will make dummy samples to make the dataset unbias.
Then we have printed the joint dataset having target class as 0 and 1. i_class0_upsampled = np.random.choice(i_class0, size=s_class1, replace=True) print(np.hstack((y[i_class0_upsampled], y[i_class1]))) So the output comes as:


Viewing at the imbalanced target vector:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

s_class0:  29

s_class1:  119

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Relevant Projects

Data Science Project on Wine Quality Prediction in R
In this R data science project, we will explore wine dataset to assess red wine quality. The objective of this data science project is to explore which chemical properties will influence the quality of red wines.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Choosing the right Time Series Forecasting Methods
There are different time series forecasting methods to forecast stock price, demand etc. In this machine learning project, you will learn to determine which forecasting method to be used when and how to apply with time series forecasting example.

Human Activity Recognition Using Multiclass Classification in Python
In this human activity recognition project, we use multiclass classification machine learning techniques to analyse fitness dataset from a smartphone tracker.

Deep Learning with Keras in R to Predict Customer Churn
In this deep learning project, we will predict customer churn using Artificial Neural Networks and learn how to model an ANN in R with the keras deep learning package.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Identifying Product Bundles from Sales Data Using R Language
In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Data Science Project-TalkingData AdTracking Fraud Detection
Machine Learning Project in R-Detect fraudulent click traffic for mobile app ads using R data science programming language.