How to Upsample Data in Python?

This tutorial will help you understand how to upsample data in Python, offering clear step-by-step instructions and practical examples. | ProjectPro
Last Updated: 03 Apr 2024

Get access to Data Science projects View all Data Science projects

DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Have you ever encountered an imbalanced dataset containing most samples of a particular class while working on classification problems? To transform the dataset to include an equal number of classes in the target value, we can upsample it. Upsampling means increasing the number of samples that are smaller in number. Check out this tutorial to understand how to upsample data in Python.

What is Upsampling in Python?
Why Upsampling?
Example of Unsampling in Python
How to Upsample Data in Python?
Learn Upsampling in Python with ProjectPro!

What is Upsampling in Python?

Let's consider a simple example of a data science healthcare use case : a data scientist is working on a medical dataset for diagnosing a rare disease. The positive cases (patients with the disease) are significantly outnumbered by the negative cases (patients without the disease).

Class 1: Patients with a rare disease (minority class), represented by 200 instances.
Class 0: Patients without the disease (majority class), represented by 2000 instances.

This dataset is imbalanced as it has fewer positive cases than negative ones. It can skew the model's predictions towards the majority class and reduce its ability to identify the minority class accurately. If this class imbalance is not addressed, the model may prioritize optimizing accuracy at the expense of correctly identifying patients with rare diseases. This will represent a risk of misclassification as the model will be biased towards predicting negative cases.

To address challenges like these, data scientists often upsample the minority class (patients with rare diseases) by generating synthetic data points or duplicating existing instances. This helps ensure that the model learns from a more balanced distribution of classes, improving its ability to detect rare diseases accurately.

Upsampling in Python refers to increasing the resolution or frequency of data points in a dataset. This technique is commonly used in signal processing, computer vision, and machine learning. In machine learning, upsampling is often employed to address class imbalance by artificially increasing the number of instances in the minority class. It involves replicating existing data points or generating synthetic data points to balance the distribution of classes in the dataset. Upsampling can be implemented using various Python libraries such as scikit-learn, TensorFlow, or PyTorch.

Why Upsampling?

Upsampling is crucial in signal processing to align multiple data streams with varying sampling rates. Data often arrives irregularly and out-of-order in real-world scenarios, leading to noise and misalignment. Upsampling techniques actively increase the sampling rate, synchronizing data streams and enabling accurate analysis and insights extraction. This active process involves adding new data points to align the streams. This process is essential for combining data from diverse sources, facilitating improved analysis and interpretation.

Example of Unsampling in Python

Let’s consider a basic example demonstrating upsampling in Python using the scipy.interpolate module for linear interpolation:

How to Upsample Data in Python?

Check out the step-by-step instructions below to learn how to implement upsampling techniques in your Python projects.

Step 1 - Import the library

import numpy as np

from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported the inbuilt wine dataset from the datasets module and stored the data in x and the target in y. This dataset is not biased, so we are making it biased to understand the functions better; we have removed the first 30 rows by selecting the rows after the 30 rows. Then, we changed the classes from 0 to 1 in the selected data.

wine = load_wine()

X = wine.data

y = wine.target

X = X[30:,:]

y = y[30:]

y = np.where((y == 0), 0, 1)

Step 3 - Upsampling the Dataset

First, we select the rows where target values are 0 and 1 in two different objects and then print the number of observations in the two objects.

i_class0 = np.where(y == 0)[0]

i_class1 = np.where(y == 1)[0]

s_class0 = len(i_class0); print(); print("s_class0: ", s_class0)

s_class1 = len(i_class1); print(); print("s_class1: ", s_class1)

The output shows that the number of samples with target values of 1 is much greater than 0. So, in upsampling, we will increase the number of samples with the target values that are lower in numbers. The functions will make dummy samples to make the dataset unbiased. Then, we printed the joint dataset with the target class as 0 and 1.

i_class0_upsampled = np.random.choice(i_class0, size=s_class1, replace=True)

print(np.hstack((y[i_class0_upsampled], y[i_class1])))

So the output comes as:

Viewing at the imbalanced target vector:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

s_class0: 29

s_class1: 119

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Learn Upsampling in Python with ProjectPro!

Learning how to upsample in Python is really important for dealing with uneven datasets. It's beneficial in fields like data science and big data analysis. Instead of just reading about it, getting hands-on practice through real projects is key. That's where ProjectPro comes in handy. It offers over 270 projects on data science, big data, and data engineering to give you practical experience. Working on these projects helps you work with real data and make better decisions. So, if you want to become skilled in data science, ProjectPro is the way to go. Check out the ProjectPro Repository today to become proficient in upsampling techniques.

Download Materials

iPython Notebook

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

How to Upsample Data in Python?

Table of Contents

What is Upsampling in Python?

Why Upsampling?

Example of Unsampling in Python

How to Upsample Data in Python?

Step 1 - Import the library

Step 2 - Setting up the Data

Step 3 - Upsampling the Dataset

Learn Upsampling in Python with ProjectPro!

Savvy Sahai

Relevant Projects

You might also like

Relevant Projects