How to Upsample Data in Python?

This tutorial will help you understand how to upsample data in Python, offering clear step-by-step instructions and practical examples. | ProjectPro

Have you ever encountered an imbalanced dataset containing most samples of a particular class while working on classification problems? To transform the dataset to include an equal number of classes in the target value, we can upsample it. Upsampling means increasing the number of samples that are smaller in number. Check out this tutorial to understand how to upsample data in Python. 

What is Upsampling in Python? 

Let's consider a simple example of a data science healthcare use case : a data scientist is working on a medical dataset for diagnosing a rare disease. The positive cases (patients with the disease) are significantly outnumbered by the negative cases (patients without the disease).

  • Class 1: Patients with a rare disease (minority class), represented by 200 instances.

  • Class 0: Patients without the disease (majority class), represented by 2000 instances.

This dataset is imbalanced as it has fewer positive cases than negative ones. It can skew the model's predictions towards the majority class and reduce its ability to identify the minority class accurately.  If this class imbalance is not addressed, the model may prioritize optimizing accuracy at the expense of correctly identifying patients with rare diseases.  This will represent a risk of misclassification as the model will be biased towards predicting negative cases.

 

To address challenges like these, data scientists often upsample the minority class (patients with rare diseases) by generating synthetic data points or duplicating existing instances. This helps ensure that the model learns from a more balanced distribution of classes, improving its ability to detect rare diseases accurately. 

 

Upsampling in Python refers to increasing the resolution or frequency of data points in a dataset. This technique is commonly used in signal processing, computer vision, and machine learning. In machine learning, upsampling is often employed to address class imbalance by artificially increasing the number of instances in the minority class. It involves replicating existing data points or generating synthetic data points to balance the distribution of classes in the dataset. Upsampling can be implemented using various Python libraries such as scikit-learn, TensorFlow, or PyTorch.

Why Upsampling? 

Upsampling is crucial in signal processing to align multiple data streams with varying sampling rates. Data often arrives irregularly and out-of-order in real-world scenarios, leading to noise and misalignment. Upsampling techniques actively increase the sampling rate, synchronizing data streams and enabling accurate analysis and insights extraction. This active process involves adding new data points to align the streams. This process is essential for combining data from diverse sources, facilitating improved analysis and interpretation. 

Example of Unsampling in Python

Let’s consider a basic example demonstrating upsampling in Python using the scipy.interpolate module for linear interpolation:

Data Upsampling Example in Python

Plotting the graph - Upsampling example

Upsampling Example

How to Upsample Data in Python?

Check out the step-by-step instructions below to learn how to implement upsampling techniques in your Python projects. 

Step 1 - Import the library

    import numpy as np

    from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported the inbuilt wine dataset from the datasets module and stored the data in x and the target in y. This dataset is not biased, so we are making it biased to understand the functions better; we have removed the first 30 rows by selecting the rows after the 30 rows. Then, we changed the classes from 0 to 1 in the selected data.   

    wine = load_wine()

 

    X = wine.data

    y = wine.target

 

    X = X[30:,:]

    y = y[30:]

    y = np.where((y == 0), 0, 1)

Step 3 - Upsampling the Dataset

First, we select the rows where target values are 0 and 1 in two different objects and then print the number of observations in the two objects.

    i_class0 = np.where(y == 0)[0]

    i_class1 = np.where(y == 1)[0]

 

    s_class0 = len(i_class0); print(); print("s_class0: ", s_class0)

    s_class1 = len(i_class1); print(); print("s_class1: ", s_class1)

The output shows that the number of samples with target values of 1 is much greater than 0. So, in upsampling, we will increase the number of samples with the target values that are lower in numbers. The functions will make dummy samples to make the dataset unbiased. Then, we printed the joint dataset with the target class as 0 and 1.

    i_class0_upsampled = np.random.choice(i_class0, size=s_class1, replace=True)

 

    print(np.hstack((y[i_class0_upsampled], y[i_class1])))

So the output comes as:

Viewing at the imbalanced target vector:

 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

 

s_class0:  29

 

s_class1:  119

 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Learn Upsampling in Python with ProjectPro! 

Learning how to upsample in Python is really important for dealing with uneven datasets. It's beneficial in fields like data science and big data analysis. Instead of just reading about it, getting hands-on practice through real projects is key. That's where ProjectPro comes in handy. It offers over 270 projects on data science, big data, and data engineering to give you practical experience. Working on these projects helps you work with real data and make better decisions. So, if you want to become skilled in data science, ProjectPro is the way to go. Check out the ProjectPro Repository today to become proficient in upsampling techniques. 

Download Materials

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Census Income Data Set Project-Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

Build Time Series Models for Gaussian Processes in Python
Time Series Project - A hands-on approach to Gaussian Processes for Time Series Modelling in Python

Skip Gram Model Python Implementation for Word Embeddings
Skip-Gram Model word2vec Example -Learn how to implement the skip gram algorithm in NLP for word embeddings on a set of documents.

Recommender System Machine Learning Project for Beginners-2
Recommender System Machine Learning Project for Beginners Part 2- Learn how to build a recommender system for market basket analysis using association rule mining.

Loan Eligibility Prediction Project using Machine learning on GCP
Loan Eligibility Prediction Project - Use SQL and Python to build a predictive model on GCP to determine whether an application requesting loan is eligible or not.

Build Regression (Linear,Ridge,Lasso) Models in NumPy Python
In this machine learning regression project, you will learn to build NumPy Regression Models (Linear Regression, Ridge Regression, Lasso Regression) from Scratch.

NLP Project for Beginners on Text Processing and Classification
This Project Explains the Basic Text Preprocessing and How to Build a Classification Model in Python

Build a Customer Churn Prediction Model using Decision Trees
Develop a customer churn prediction model using decision tree machine learning algorithms and data science on streaming service data.

Langchain Project for Customer Support App in Python
In this LLM Project, you will learn how to enhance customer support interactions through Large Language Models (LLMs), enabling intelligent, context-aware responses. This Langchain project aims to seamlessly integrate LLM technology with databases, PDF knowledge bases, and audio processing agents to create a comprehensive customer support application.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.