How to Upsample Data in Python?

This tutorial will help you understand how to upsample data in Python, offering clear step-by-step instructions and practical examples. | ProjectPro

Have you ever encountered an imbalanced dataset containing most samples of a particular class while working on classification problems? To transform the dataset to include an equal number of classes in the target value, we can upsample it. Upsampling means increasing the number of samples that are smaller in number. Check out this tutorial to understand how to upsample data in Python. 

What is Upsampling in Python? 

Let's consider a simple example of a data science healthcare use case : a data scientist is working on a medical dataset for diagnosing a rare disease. The positive cases (patients with the disease) are significantly outnumbered by the negative cases (patients without the disease).

  • Class 1: Patients with a rare disease (minority class), represented by 200 instances.

  • Class 0: Patients without the disease (majority class), represented by 2000 instances.

This dataset is imbalanced as it has fewer positive cases than negative ones. It can skew the model's predictions towards the majority class and reduce its ability to identify the minority class accurately.  If this class imbalance is not addressed, the model may prioritize optimizing accuracy at the expense of correctly identifying patients with rare diseases.  This will represent a risk of misclassification as the model will be biased towards predicting negative cases.

 

To address challenges like these, data scientists often upsample the minority class (patients with rare diseases) by generating synthetic data points or duplicating existing instances. This helps ensure that the model learns from a more balanced distribution of classes, improving its ability to detect rare diseases accurately. 

 

Upsampling in Python refers to increasing the resolution or frequency of data points in a dataset. This technique is commonly used in signal processing, computer vision, and machine learning. In machine learning, upsampling is often employed to address class imbalance by artificially increasing the number of instances in the minority class. It involves replicating existing data points or generating synthetic data points to balance the distribution of classes in the dataset. Upsampling can be implemented using various Python libraries such as scikit-learn, TensorFlow, or PyTorch.

Why Upsampling? 

Upsampling is crucial in signal processing to align multiple data streams with varying sampling rates. Data often arrives irregularly and out-of-order in real-world scenarios, leading to noise and misalignment. Upsampling techniques actively increase the sampling rate, synchronizing data streams and enabling accurate analysis and insights extraction. This active process involves adding new data points to align the streams. This process is essential for combining data from diverse sources, facilitating improved analysis and interpretation. 

Example of Unsampling in Python

Let’s consider a basic example demonstrating upsampling in Python using the scipy.interpolate module for linear interpolation:

Data Upsampling Example in Python

Plotting the graph - Upsampling example

Upsampling Example

How to Upsample Data in Python?

Check out the step-by-step instructions below to learn how to implement upsampling techniques in your Python projects. 

Step 1 - Import the library

    import numpy as np

    from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported the inbuilt wine dataset from the datasets module and stored the data in x and the target in y. This dataset is not biased, so we are making it biased to understand the functions better; we have removed the first 30 rows by selecting the rows after the 30 rows. Then, we changed the classes from 0 to 1 in the selected data.   

    wine = load_wine()

 

    X = wine.data

    y = wine.target

 

    X = X[30:,:]

    y = y[30:]

    y = np.where((y == 0), 0, 1)

Step 3 - Upsampling the Dataset

First, we select the rows where target values are 0 and 1 in two different objects and then print the number of observations in the two objects.

    i_class0 = np.where(y == 0)[0]

    i_class1 = np.where(y == 1)[0]

 

    s_class0 = len(i_class0); print(); print("s_class0: ", s_class0)

    s_class1 = len(i_class1); print(); print("s_class1: ", s_class1)

The output shows that the number of samples with target values of 1 is much greater than 0. So, in upsampling, we will increase the number of samples with the target values that are lower in numbers. The functions will make dummy samples to make the dataset unbiased. Then, we printed the joint dataset with the target class as 0 and 1.

    i_class0_upsampled = np.random.choice(i_class0, size=s_class1, replace=True)

 

    print(np.hstack((y[i_class0_upsampled], y[i_class1])))

So the output comes as:

Viewing at the imbalanced target vector:

 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

 

s_class0:  29

 

s_class1:  119

 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Learn Upsampling in Python with ProjectPro! 

Learning how to upsample in Python is really important for dealing with uneven datasets. It's beneficial in fields like data science and big data analysis. Instead of just reading about it, getting hands-on practice through real projects is key. That's where ProjectPro comes in handy. It offers over 270 projects on data science, big data, and data engineering to give you practical experience. Working on these projects helps you work with real data and make better decisions. So, if you want to become skilled in data science, ProjectPro is the way to go. Check out the ProjectPro Repository today to become proficient in upsampling techniques. 

Download Materials

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

MLOps AWS Project on Topic Modeling using Gunicorn Flask
In this project we will see the end-to-end machine learning development process to design, build and manage reproducible, testable, and evolvable machine learning models by using AWS

NLP Project to Build a Resume Parser in Python using Spacy
Use the popular Spacy NLP python library for OCR and text classification to build a Resume Parser in Python.

Time Series Project to Build a Multiple Linear Regression Model
Learn to build a Multiple linear regression model in Python on Time Series Data

Recommender System Machine Learning Project for Beginners-1
Recommender System Machine Learning Project for Beginners - Learn how to design, implement and train a rule-based recommender system in Python

Classification Projects on Machine Learning for Beginners - 1
Classification ML Project for Beginners - A Hands-On Approach to Implementing Different Types of Classification Algorithms in Machine Learning for Predictive Modelling

End-to-End ML Model Monitoring using Airflow and Docker
In this MLOps Project, you will learn to build an end to end pipeline to monitor any changes in the predictive power of model or degradation of data.

Deep Learning Project- Real-Time Fruit Detection using YOLOv4
In this deep learning project, you will learn to build an accurate, fast, and reliable real-time fruit detection system using the YOLOv4 object detection model for robotic harvesting platforms.

Personalized Medicine: Redefining Cancer Treatment
In this Personalized Medicine Machine Learning Project you will learn to classify genetic mutations on the basis of medical literature into 9 classes.

BERT Text Classification using DistilBERT and ALBERT Models
This Project Explains how to perform Text Classification using ALBERT and DistilBERT

Build a Graph Based Recommendation System in Python-Part 2
In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.