How to Upsample Data in Python?

This tutorial will help you understand how to upsample data in Python, offering clear step-by-step instructions and practical examples. | ProjectPro
Last Updated: 03 Apr 2024

Get access to Data Science projects View all Data Science projects

DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Have you ever encountered an imbalanced dataset containing most samples of a particular class while working on classification problems? To transform the dataset to include an equal number of classes in the target value, we can upsample it. Upsampling means increasing the number of samples that are smaller in number. Check out this tutorial to understand how to upsample data in Python.

What is Upsampling in Python?
Why Upsampling?
Example of Unsampling in Python
How to Upsample Data in Python?
Learn Upsampling in Python with ProjectPro!

What is Upsampling in Python?

Let's consider a simple example of a data science healthcare use case : a data scientist is working on a medical dataset for diagnosing a rare disease. The positive cases (patients with the disease) are significantly outnumbered by the negative cases (patients without the disease).

Class 1: Patients with a rare disease (minority class), represented by 200 instances.
Class 0: Patients without the disease (majority class), represented by 2000 instances.

This dataset is imbalanced as it has fewer positive cases than negative ones. It can skew the model's predictions towards the majority class and reduce its ability to identify the minority class accurately. If this class imbalance is not addressed, the model may prioritize optimizing accuracy at the expense of correctly identifying patients with rare diseases. This will represent a risk of misclassification as the model will be biased towards predicting negative cases.

To address challenges like these, data scientists often upsample the minority class (patients with rare diseases) by generating synthetic data points or duplicating existing instances. This helps ensure that the model learns from a more balanced distribution of classes, improving its ability to detect rare diseases accurately.

Upsampling in Python refers to increasing the resolution or frequency of data points in a dataset. This technique is commonly used in signal processing, computer vision, and machine learning. In machine learning, upsampling is often employed to address class imbalance by artificially increasing the number of instances in the minority class. It involves replicating existing data points or generating synthetic data points to balance the distribution of classes in the dataset. Upsampling can be implemented using various Python libraries such as scikit-learn, TensorFlow, or PyTorch.

Why Upsampling?

Upsampling is crucial in signal processing to align multiple data streams with varying sampling rates. Data often arrives irregularly and out-of-order in real-world scenarios, leading to noise and misalignment. Upsampling techniques actively increase the sampling rate, synchronizing data streams and enabling accurate analysis and insights extraction. This active process involves adding new data points to align the streams. This process is essential for combining data from diverse sources, facilitating improved analysis and interpretation.

Example of Unsampling in Python

Let’s consider a basic example demonstrating upsampling in Python using the scipy.interpolate module for linear interpolation:

How to Upsample Data in Python?

Check out the step-by-step instructions below to learn how to implement upsampling techniques in your Python projects.

Step 1 - Import the library

import numpy as np

from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported the inbuilt wine dataset from the datasets module and stored the data in x and the target in y. This dataset is not biased, so we are making it biased to understand the functions better; we have removed the first 30 rows by selecting the rows after the 30 rows. Then, we changed the classes from 0 to 1 in the selected data.

wine = load_wine()

X = wine.data

y = wine.target

X = X[30:,:]

y = y[30:]

y = np.where((y == 0), 0, 1)

Step 3 - Upsampling the Dataset

First, we select the rows where target values are 0 and 1 in two different objects and then print the number of observations in the two objects.

i_class0 = np.where(y == 0)[0]

i_class1 = np.where(y == 1)[0]

s_class0 = len(i_class0); print(); print("s_class0: ", s_class0)

s_class1 = len(i_class1); print(); print("s_class1: ", s_class1)

The output shows that the number of samples with target values of 1 is much greater than 0. So, in upsampling, we will increase the number of samples with the target values that are lower in numbers. The functions will make dummy samples to make the dataset unbiased. Then, we printed the joint dataset with the target class as 0 and 1.

i_class0_upsampled = np.random.choice(i_class0, size=s_class1, replace=True)

print(np.hstack((y[i_class0_upsampled], y[i_class1])))

So the output comes as:

Viewing at the imbalanced target vector:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

s_class0: 29

s_class1: 119

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Learn Upsampling in Python with ProjectPro!

Learning how to upsample in Python is really important for dealing with uneven datasets. It's beneficial in fields like data science and big data analysis. Instead of just reading about it, getting hands-on practice through real projects is key. That's where ProjectPro comes in handy. It offers over 270 projects on data science, big data, and data engineering to give you practical experience. Working on these projects helps you work with real data and make better decisions. So, if you want to become skilled in data science, ProjectPro is the way to go. Check out the ProjectPro Repository today to become proficient in upsampling techniques.

Download Materials

iPython Notebook

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Census Income Data Set Project-Predict Adult Census Income

Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

View Project Details

Build Time Series Models for Gaussian Processes in Python

Time Series Project - A hands-on approach to Gaussian Processes for Time Series Modelling in Python

View Project Details

Skip Gram Model Python Implementation for Word Embeddings

Skip-Gram Model word2vec Example -Learn how to implement the skip gram algorithm in NLP for word embeddings on a set of documents.

View Project Details

Recommender System Machine Learning Project for Beginners-2

Recommender System Machine Learning Project for Beginners Part 2- Learn how to build a recommender system for market basket analysis using association rule mining.

View Project Details

Loan Eligibility Prediction Project using Machine learning on GCP

Loan Eligibility Prediction Project - Use SQL and Python to build a predictive model on GCP to determine whether an application requesting loan is eligible or not.

View Project Details

Build Regression (Linear,Ridge,Lasso) Models in NumPy Python

In this machine learning regression project, you will learn to build NumPy Regression Models (Linear Regression, Ridge Regression, Lasso Regression) from Scratch.

View Project Details

NLP Project for Beginners on Text Processing and Classification

This Project Explains the Basic Text Preprocessing and How to Build a Classification Model in Python

View Project Details

Build a Customer Churn Prediction Model using Decision Trees

Develop a customer churn prediction model using decision tree machine learning algorithms and data science on streaming service data.

View Project Details

Langchain Project for Customer Support App in Python

In this LLM Project, you will learn how to enhance customer support interactions through Large Language Models (LLMs), enabling intelligent, context-aware responses. This Langchain project aims to seamlessly integrate LLM technology with databases, PDF knowledge bases, and audio processing agents to create a comprehensive customer support application.

View Project Details

Predict Churn for a Telecom company using Logistic Regression

Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

View Project Details

How to Upsample Data in Python?

Table of Contents

What is Upsampling in Python?

Why Upsampling?

Example of Unsampling in Python

How to Upsample Data in Python?

Step 1 - Import the library

Step 2 - Setting up the Data

Step 3 - Upsampling the Dataset

Learn Upsampling in Python with ProjectPro!

Ed Godalle

Relevant Projects

You might also like

Relevant Projects