How to Downsample Data in Python?

This tutorial will help you learn all about data downsampling in Python. Simplify large datasets without sacrificing valuable information. | ProjectPro
Last Updated: 02 Apr 2024

Get access to Data Science projects View all Data Science projects

DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Have you ever encountered a biased dataset that contains most samples of a particular class while working on classification problems? To transform the dataset so that it includes an equal number of classes in the target value, we can downsample it. Downsampling means reducing the number of samples that have a bias class. So, check out this tutorial to understand how to downsample data in Python.

What is Downsampling in Python?
Why Downsampling?
What are the Methods of Downsampling in Python?
Example of Downsampling in Python
How to Downsample Data in Python?
Downsampling in Python - Considerations and Challenges
Learn Downsampling in Python with ProjectPro!

What is Downsampling in Python?

Data scientists usually downsample data to address imbalances in a dataset, particularly in scenarios where one class or category is significantly overrepresented. Downsampling involves reducing the size of the majority class to achieve a more balanced distribution, which can lead to better model performance and prevent biased predictions.

For example, let’s consider a dataset for predicting customer churn in a subscription-based service like ProjectPro where only 10% of customers churned while the remaining 90% did not. If a model is trained on this imbalanced dataset, it may inaccurately prioritize the majority class, leading to poor predictive performance for the minority class of churned customers.

To address this, data scientists can downsample the majority class by randomly selecting a subset of observations from it to match the size of the minority class. This ensures that the model receives equal representation from each class, allowing it to learn patterns and make predictions more effectively.

Downsampling in Python involves reducing data volume for various purposes, such as analysis, storage, or preprocessing. This process is crucial for managing computational resources efficiently and minimizing noise within datasets. In applications like time series data, downsampling allows for alignment with specific problem-solving needs by adjusting granularities.

Why Downsampling?

The primary motivation behind downsampling is to alleviate computational demands, mitigate storage constraints, and potentially diminish noise within datasets. This process proves particularly beneficial when the original dataset size could be more robust or when aligning data granularity with analytical requirements.

What are the Methods of Downsampling in Python?

Python offers multiple methodologies for downsampling datasets:

Average Pooling: This technique involves averaging data points to derive a representative single point. For instance, averaging data points within each minute in time series data could reduce granularity from seconds to minutes, streamlining analysis.

Decimation: In decimation, specific data points are discarded without replacements. For instance, retaining every nth data point while discarding the rest can significantly reduce dataset size.
Reservoir Sampling is a randomized algorithm for selecting a representative sample from an unknown dataset. It guarantees that each possible subset has an equal probability of being selected, making it useful when the dataset size exceeds memory capacity.

Example of Downsampling in Python

Let's consider an example where we have a time series dataset containing measurements taken every second, and we want to downsample it to reduce the granularity to minutes using average pooling.

Python Downsampling Example

This example uses the pandas library to work with time series data. We first create a DataFrame df with timestamps ranging from '2024-01-01' to '2024-01-02' with a frequency of one second. Then, we set the timestamp column as the index of the DataFrame. Finally, we use the resample() method to downsample the data to minutes ('T' stands for minutes) and calculate the mean value for each minute.

This process reduces the number of data points from 86,400 (one per second) to 1,440 (one per minute), making the data more accessible to work with while still preserving some meaningful information.

How to Downsample Data in Python?

Here is a step-by-step journey through understanding the process of downsampling data in Python -

Step 1 - Import the library

import numpy as np

from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported the inbuilt wine dataset from the datasets module and stored the data in x and the target in y. This dataset is not biased, so we are making it biased to understand the functions better; we have removed the first 30 rows by selecting the rows after the 30 rows. Then, we changed the classes from 0 to 1 in the selected data.

wine = datasets.load_wine()

X = wine.data

y = wine.target

X = X[30:,:]

y = y[30:]

y = np.where((y == 0), 0, 1)

print("Viewing the imbalanced target vector:\n", y)

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

Step 3 - Downsampling the dataset

First, we select the rows where target values are 0 and 1 in two different objects and then print the number of observations in the two objects.

w_class0 = np.where(y == 0)[0]

w_class1 = np.where(y == 1)[0]

n_class0 = len(w_class0)

n_class1 = len(w_class1)

print("n_class0: ", n_class0)

print("n_class1: ", n_class1)

In the output, we will see that the number of samples having target values of 1 is much greater than 0. So, in downsampling, we will randomly select the number of rows having target values of 1 and make it equal to the number of rows having target values of 0. Then, we will print the joint dataset having target classes of 0 and 1.

w_class1_downsampled = np.random.choice(w_class1, size=n_class0, replace=False)

print(); print(np.hstack((y[w_class0], y[w_class1_downsampled])))

So the output comes as:

Viewing the imbalanced target vector:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

n_class0: 29

n_class1: 119

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Downsampling in Python - Considerations and Challenges

While downsampling offers computational advantages, it necessitates careful handling. Improper downsampling techniques can lead to loss of critical information, potentially skewing analytical outcomes or model predictions. It's essential to balance data reduction and preservation of essential features.

Moreover, downsampling should be distinct from data compression techniques, which aim to reduce storage requirements without sacrificing information integrity. Techniques like encoding and quantization are commonly employed in data compression, unlike downsampling, which focuses on reducing dataset size while retaining relevant information.

Learn Downsampling in Python with ProjectPro!

Downsampling data in Python is valuable for any data scientist or analyst seeking to handle large datasets while preserving important information efficiently. This guide explored various downsampling techniques and their implementation using Python libraries such as Pandas and Scikit-learn. Whether you're working with time-series data, images, or any other form of data, understanding downsampling methods empowers you to streamline your analysis, improve computational efficiency, and derive meaningful insights. However, theoretical knowledge alone is insufficient. Practical experience through real-world projects is paramount in solidifying understanding and honing expertise. With ProjectPro's comprehensive repository of over 270+ projects spanning data science and big data domains, aspiring professionals can apply their knowledge in practical scenarios.

Download Materials

iPython Notebook

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Forecasting Business KPI's with Tensorflow and Python

In this machine learning project, you will use the video clip of an IPL match played between CSK and RCB to forecast key performance indicators like the number of appearances of a brand logo, the frames, and the shortest and longest area percentage in the video.

View Project Details

Build a Logistic Regression Model in Python from Scratch

Regression project to implement logistic regression in python from scratch on streaming app data.

View Project Details

Deep Learning Project for Time Series Forecasting in Python

Deep Learning for Time Series Forecasting in Python -A Hands-On Approach to Build Deep Learning Models (MLP, CNN, LSTM, and a Hybrid Model CNN-LSTM) on Time Series Data.

View Project Details

AWS MLOps Project for ARCH and GARCH Time Series Models

Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .

View Project Details

Deep Learning Project for Beginners with Source Code Part 1

Learn to implement deep neural networks in Python .

View Project Details

Ecommerce product reviews - Pairwise ranking and sentiment analysis

This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

View Project Details

How to Downsample Data in Python?

Table of Contents

What is Downsampling in Python?

Why Downsampling?

What are the Methods of Downsampling in Python?

Example of Downsampling in Python

How to Downsample Data in Python?

Step 1 - Import the library

Step 2 - Setting up the Data

Step 3 - Downsampling the dataset

Downsampling in Python - Considerations and Challenges

Learn Downsampling in Python with ProjectPro!

Ed Godalle

Relevant Projects

You might also like

Relevant Projects