How to Downsample Data in Python?

This tutorial will help you learn all about data downsampling in Python. Simplify large datasets without sacrificing valuable information. | ProjectPro

Have you ever encountered a biased dataset that contains most samples of a particular class while working on classification problems? To transform the dataset so that it includes an equal number of classes in the target value, we can downsample it. Downsampling means reducing the number of samples that have a bias class. So, check out this tutorial to understand how to downsample data in Python.

What is Downsampling in Python? 

Data scientists usually downsample data to address imbalances in a dataset, particularly in scenarios where one class or category is significantly overrepresented. Downsampling involves reducing the size of the majority class to achieve a more balanced distribution, which can lead to better model performance and prevent biased predictions.

For example, let’s consider a dataset for predicting customer churn in a subscription-based service like ProjectPro where only 10% of customers churned while the remaining 90% did not. If a model is trained on this imbalanced dataset, it may inaccurately prioritize the majority class, leading to poor predictive performance for the minority class of churned customers.

To address this, data scientists can downsample the majority class by randomly selecting a subset of observations from it to match the size of the minority class. This ensures that the model receives equal representation from each class, allowing it to learn patterns and make predictions more effectively.

Downsampling in Python involves reducing data volume for various purposes, such as analysis, storage, or preprocessing. This process is crucial for managing computational resources efficiently and minimizing noise within datasets. In applications like time series data, downsampling allows for alignment with specific problem-solving needs by adjusting granularities. 

Why Downsampling?

The primary motivation behind downsampling is to alleviate computational demands, mitigate storage constraints, and potentially diminish noise within datasets. This process proves particularly beneficial when the original dataset size could be more robust or when aligning data granularity with analytical requirements. 

What are the Methods of Downsampling in Python? 

Python offers multiple methodologies for downsampling datasets:

  • Average Pooling: This technique involves averaging data points to derive a representative single point. For instance, averaging data points within each minute in time series data could reduce granularity from seconds to minutes, streamlining analysis.

  • Decimation: In decimation, specific data points are discarded without replacements. For instance, retaining every nth data point while discarding the rest can significantly reduce dataset size.

  • Reservoir Sampling is a randomized algorithm for selecting a representative sample from an unknown dataset. It guarantees that each possible subset has an equal probability of being selected, making it useful when the dataset size exceeds memory capacity. 

Example of Downsampling in Python 

Let's consider an example where we have a time series dataset containing measurements taken every second, and we want to downsample it to reduce the granularity to minutes using average pooling.

Python Downsampling Example

This example uses the pandas library to work with time series data. We first create a DataFrame df with timestamps ranging from '2024-01-01' to '2024-01-02' with a frequency of one second. Then, we set the timestamp column as the index of the DataFrame. Finally, we use the resample() method to downsample the data to minutes ('T' stands for minutes) and calculate the mean value for each minute.

This process reduces the number of data points from 86,400 (one per second) to 1,440 (one per minute), making the data more accessible to work with while still preserving some meaningful information.

How to Downsample Data in Python? 

Here is a step-by-step journey through understanding the process of downsampling data in Python - 

Step 1 - Import the library

    import numpy as np

    from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported the inbuilt wine dataset from the datasets module and stored the data in x and the target in y. This dataset is not biased, so we are making it biased to understand the functions better; we have removed the first 30 rows by selecting the rows after the 30 rows. Then, we changed the classes from 0 to 1 in the selected data.   

    wine = datasets.load_wine()

 

    X = wine.data

    y = wine.target

 

    X = X[30:,:]

    y = y[30:]

 

    y = np.where((y == 0), 0, 1)

    print("Viewing the imbalanced target vector:\n", y)

 

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

 

Step 3 - Downsampling the dataset

First, we select the rows where target values are 0 and 1 in two different objects and then print the number of observations in the two objects.

    w_class0 = np.where(y == 0)[0]

    w_class1 = np.where(y == 1)[0]

 

    n_class0 = len(w_class0) 

    n_class1 = len(w_class1)

 

    print("n_class0: ", n_class0)

    print("n_class1: ", n_class1)

In the output, we will see that the number of samples having target values of 1 is much greater than 0. So, in downsampling, we will randomly select the number of rows having target values of 1 and make it equal to the number of rows having target values of 0. Then, we will print the joint dataset having target classes of 0 and 1. 

    w_class1_downsampled = np.random.choice(w_class1, size=n_class0, replace=False)

    print(); print(np.hstack((y[w_class0], y[w_class1_downsampled])))

So the output comes as:

Viewing the imbalanced target vector:

 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

 

n_class0:  29

n_class1:  119

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] 

Downsampling in Python - Considerations and Challenges

While downsampling offers computational advantages, it necessitates careful handling. Improper downsampling techniques can lead to loss of critical information, potentially skewing analytical outcomes or model predictions. It's essential to balance data reduction and preservation of essential features.

Moreover, downsampling should be distinct from data compression techniques, which aim to reduce storage requirements without sacrificing information integrity. Techniques like encoding and quantization are commonly employed in data compression, unlike downsampling, which focuses on reducing dataset size while retaining relevant information.

Learn Downsampling in Python with ProjectPro! 

Downsampling data in Python is valuable for any data scientist or analyst seeking to handle large datasets while preserving important information efficiently. This guide explored various downsampling techniques and their implementation using Python libraries such as Pandas and Scikit-learn. Whether you're working with time-series data, images, or any other form of data, understanding downsampling methods empowers you to streamline your analysis, improve computational efficiency, and derive meaningful insights. However, theoretical knowledge alone is insufficient. Practical experience through real-world projects is paramount in solidifying understanding and honing expertise. With ProjectPro's comprehensive repository of over 270+ projects spanning data science and big data domains, aspiring professionals can apply their knowledge in practical scenarios.  

Download Materials

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Forecasting Business KPI's with Tensorflow and Python
In this machine learning project, you will use the video clip of an IPL match played between CSK and RCB to forecast key performance indicators like the number of appearances of a brand logo, the frames, and the shortest and longest area percentage in the video.

Build a Logistic Regression Model in Python from Scratch
Regression project to implement logistic regression in python from scratch on streaming app data.

Deep Learning Project for Time Series Forecasting in Python
Deep Learning for Time Series Forecasting in Python -A Hands-On Approach to Build Deep Learning Models (MLP, CNN, LSTM, and a Hybrid Model CNN-LSTM) on Time Series Data.

AWS MLOps Project for ARCH and GARCH Time Series Models
Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .

Deep Learning Project for Beginners with Source Code Part 1
Learn to implement deep neural networks in Python .

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

NLP Project on LDA Topic Modelling Python using RACE Dataset
Use the RACE dataset to extract a dominant topic from each document and perform LDA topic modeling in python.

Build Time Series Models for Gaussian Processes in Python
Time Series Project - A hands-on approach to Gaussian Processes for Time Series Modelling in Python

Build a Multi-Class Classification Model in Python on Saturn Cloud
In this machine learning classification project, you will build a multi-class classification model in Python on Saturn Cloud to predict the license status of a business.

Time Series Classification Project for Elevator Failure Prediction
In this Time Series Project, you will predict the failure of elevators using IoT sensor data as a time series classification machine learning problem.