How to Upsample Data in Python?

This tutorial will help you understand how to upsample data in Python, offering clear step-by-step instructions and practical examples. | ProjectPro

Have you ever encountered an imbalanced dataset containing most samples of a particular class while working on classification problems? To transform the dataset to include an equal number of classes in the target value, we can upsample it. Upsampling means increasing the number of samples that are smaller in number. Check out this tutorial to understand how to upsample data in Python. 

What is Upsampling in Python? 

Let's consider a simple example of a data science healthcare use case : a data scientist is working on a medical dataset for diagnosing a rare disease. The positive cases (patients with the disease) are significantly outnumbered by the negative cases (patients without the disease).

  • Class 1: Patients with a rare disease (minority class), represented by 200 instances.

  • Class 0: Patients without the disease (majority class), represented by 2000 instances.

This dataset is imbalanced as it has fewer positive cases than negative ones. It can skew the model's predictions towards the majority class and reduce its ability to identify the minority class accurately.  If this class imbalance is not addressed, the model may prioritize optimizing accuracy at the expense of correctly identifying patients with rare diseases.  This will represent a risk of misclassification as the model will be biased towards predicting negative cases.

 

To address challenges like these, data scientists often upsample the minority class (patients with rare diseases) by generating synthetic data points or duplicating existing instances. This helps ensure that the model learns from a more balanced distribution of classes, improving its ability to detect rare diseases accurately. 

 

Upsampling in Python refers to increasing the resolution or frequency of data points in a dataset. This technique is commonly used in signal processing, computer vision, and machine learning. In machine learning, upsampling is often employed to address class imbalance by artificially increasing the number of instances in the minority class. It involves replicating existing data points or generating synthetic data points to balance the distribution of classes in the dataset. Upsampling can be implemented using various Python libraries such as scikit-learn, TensorFlow, or PyTorch.

Why Upsampling? 

Upsampling is crucial in signal processing to align multiple data streams with varying sampling rates. Data often arrives irregularly and out-of-order in real-world scenarios, leading to noise and misalignment. Upsampling techniques actively increase the sampling rate, synchronizing data streams and enabling accurate analysis and insights extraction. This active process involves adding new data points to align the streams. This process is essential for combining data from diverse sources, facilitating improved analysis and interpretation. 

Example of Unsampling in Python

Let’s consider a basic example demonstrating upsampling in Python using the scipy.interpolate module for linear interpolation:

Data Upsampling Example in Python

Plotting the graph - Upsampling example

Upsampling Example

How to Upsample Data in Python?

Check out the step-by-step instructions below to learn how to implement upsampling techniques in your Python projects. 

Step 1 - Import the library

    import numpy as np

    from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported the inbuilt wine dataset from the datasets module and stored the data in x and the target in y. This dataset is not biased, so we are making it biased to understand the functions better; we have removed the first 30 rows by selecting the rows after the 30 rows. Then, we changed the classes from 0 to 1 in the selected data.   

    wine = load_wine()

 

    X = wine.data

    y = wine.target

 

    X = X[30:,:]

    y = y[30:]

    y = np.where((y == 0), 0, 1)

Step 3 - Upsampling the Dataset

First, we select the rows where target values are 0 and 1 in two different objects and then print the number of observations in the two objects.

    i_class0 = np.where(y == 0)[0]

    i_class1 = np.where(y == 1)[0]

 

    s_class0 = len(i_class0); print(); print("s_class0: ", s_class0)

    s_class1 = len(i_class1); print(); print("s_class1: ", s_class1)

The output shows that the number of samples with target values of 1 is much greater than 0. So, in upsampling, we will increase the number of samples with the target values that are lower in numbers. The functions will make dummy samples to make the dataset unbiased. Then, we printed the joint dataset with the target class as 0 and 1.

    i_class0_upsampled = np.random.choice(i_class0, size=s_class1, replace=True)

 

    print(np.hstack((y[i_class0_upsampled], y[i_class1])))

So the output comes as:

Viewing at the imbalanced target vector:

 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

 

s_class0:  29

 

s_class1:  119

 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Learn Upsampling in Python with ProjectPro! 

Learning how to upsample in Python is really important for dealing with uneven datasets. It's beneficial in fields like data science and big data analysis. Instead of just reading about it, getting hands-on practice through real projects is key. That's where ProjectPro comes in handy. It offers over 270 projects on data science, big data, and data engineering to give you practical experience. Working on these projects helps you work with real data and make better decisions. So, if you want to become skilled in data science, ProjectPro is the way to go. Check out the ProjectPro Repository today to become proficient in upsampling techniques. 

Download Materials

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Learn How to Build a Logistic Regression Model in PyTorch
In this Machine Learning Project, you will learn how to build a simple logistic regression model in PyTorch for customer churn prediction.

Abstractive Text Summarization using Transformers-BART Model
Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.

AWS MLOps Project to Deploy Multiple Linear Regression Model
Build and Deploy a Multiple Linear Regression Model in Python on AWS

A/B Testing Approach for Comparing Performance of ML Models
The objective of this project is to compare the performance of BERT and DistilBERT models for building an efficient Question and Answering system. Using A/B testing approach, we explore the effectiveness and efficiency of both models and determine which one is better suited for Q&A tasks.

End-to-End ML Model Monitoring using Airflow and Docker
In this MLOps Project, you will learn to build an end to end pipeline to monitor any changes in the predictive power of model or degradation of data.

ML Model Deployment on AWS for Customer Churn Prediction
MLOps Project-Deploy Machine Learning Model to Production Python on AWS for Customer Churn Prediction

Build Classification Algorithms for Digital Transformation[Banking]
Implement a machine learning approach using various classification techniques in Python to examine the digitalisation process of bank customers.

AWS MLOps Project for Gaussian Process Time Series Modeling
MLOps Project to Build and Deploy a Gaussian Process Time Series Model in Python on AWS

Credit Card Default Prediction using Machine learning techniques
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Build ARCH and GARCH Models in Time Series using Python
In this Project we will build an ARCH and a GARCH model using Python