How to Downsample Data in Python?

This tutorial will help you learn all about data downsampling in Python. Simplify large datasets without sacrificing valuable information. | ProjectPro

Have you ever encountered a biased dataset that contains most samples of a particular class while working on classification problems? To transform the dataset so that it includes an equal number of classes in the target value, we can downsample it. Downsampling means reducing the number of samples that have a bias class. So, check out this tutorial to understand how to downsample data in Python.

What is Downsampling in Python? 

Data scientists usually downsample data to address imbalances in a dataset, particularly in scenarios where one class or category is significantly overrepresented. Downsampling involves reducing the size of the majority class to achieve a more balanced distribution, which can lead to better model performance and prevent biased predictions.

For example, let’s consider a dataset for predicting customer churn in a subscription-based service like ProjectPro where only 10% of customers churned while the remaining 90% did not. If a model is trained on this imbalanced dataset, it may inaccurately prioritize the majority class, leading to poor predictive performance for the minority class of churned customers.

To address this, data scientists can downsample the majority class by randomly selecting a subset of observations from it to match the size of the minority class. This ensures that the model receives equal representation from each class, allowing it to learn patterns and make predictions more effectively.

Downsampling in Python involves reducing data volume for various purposes, such as analysis, storage, or preprocessing. This process is crucial for managing computational resources efficiently and minimizing noise within datasets. In applications like time series data, downsampling allows for alignment with specific problem-solving needs by adjusting granularities. 

Why Downsampling?

The primary motivation behind downsampling is to alleviate computational demands, mitigate storage constraints, and potentially diminish noise within datasets. This process proves particularly beneficial when the original dataset size could be more robust or when aligning data granularity with analytical requirements. 

What are the Methods of Downsampling in Python? 

Python offers multiple methodologies for downsampling datasets:

  • Average Pooling: This technique involves averaging data points to derive a representative single point. For instance, averaging data points within each minute in time series data could reduce granularity from seconds to minutes, streamlining analysis.

  • Decimation: In decimation, specific data points are discarded without replacements. For instance, retaining every nth data point while discarding the rest can significantly reduce dataset size.

  • Reservoir Sampling is a randomized algorithm for selecting a representative sample from an unknown dataset. It guarantees that each possible subset has an equal probability of being selected, making it useful when the dataset size exceeds memory capacity. 

Example of Downsampling in Python 

Let's consider an example where we have a time series dataset containing measurements taken every second, and we want to downsample it to reduce the granularity to minutes using average pooling.

Python Downsampling Example

This example uses the pandas library to work with time series data. We first create a DataFrame df with timestamps ranging from '2024-01-01' to '2024-01-02' with a frequency of one second. Then, we set the timestamp column as the index of the DataFrame. Finally, we use the resample() method to downsample the data to minutes ('T' stands for minutes) and calculate the mean value for each minute.

This process reduces the number of data points from 86,400 (one per second) to 1,440 (one per minute), making the data more accessible to work with while still preserving some meaningful information.

How to Downsample Data in Python? 

Here is a step-by-step journey through understanding the process of downsampling data in Python - 

Step 1 - Import the library

    import numpy as np

    from sklearn import datasets

We have imported numpy and datasets modules.

Step 2 - Setting up the Data

We have imported the inbuilt wine dataset from the datasets module and stored the data in x and the target in y. This dataset is not biased, so we are making it biased to understand the functions better; we have removed the first 30 rows by selecting the rows after the 30 rows. Then, we changed the classes from 0 to 1 in the selected data.   

    wine = datasets.load_wine()

 

    X = wine.data

    y = wine.target

 

    X = X[30:,:]

    y = y[30:]

 

    y = np.where((y == 0), 0, 1)

    print("Viewing the imbalanced target vector:\n", y)

 

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

 

Step 3 - Downsampling the dataset

First, we select the rows where target values are 0 and 1 in two different objects and then print the number of observations in the two objects.

    w_class0 = np.where(y == 0)[0]

    w_class1 = np.where(y == 1)[0]

 

    n_class0 = len(w_class0) 

    n_class1 = len(w_class1)

 

    print("n_class0: ", n_class0)

    print("n_class1: ", n_class1)

In the output, we will see that the number of samples having target values of 1 is much greater than 0. So, in downsampling, we will randomly select the number of rows having target values of 1 and make it equal to the number of rows having target values of 0. Then, we will print the joint dataset having target classes of 0 and 1. 

    w_class1_downsampled = np.random.choice(w_class1, size=n_class0, replace=False)

    print(); print(np.hstack((y[w_class0], y[w_class1_downsampled])))

So the output comes as:

Viewing the imbalanced target vector:

 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

 

n_class0:  29

n_class1:  119

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] 

Downsampling in Python - Considerations and Challenges

While downsampling offers computational advantages, it necessitates careful handling. Improper downsampling techniques can lead to loss of critical information, potentially skewing analytical outcomes or model predictions. It's essential to balance data reduction and preservation of essential features.

Moreover, downsampling should be distinct from data compression techniques, which aim to reduce storage requirements without sacrificing information integrity. Techniques like encoding and quantization are commonly employed in data compression, unlike downsampling, which focuses on reducing dataset size while retaining relevant information.

Learn Downsampling in Python with ProjectPro! 

Downsampling data in Python is valuable for any data scientist or analyst seeking to handle large datasets while preserving important information efficiently. This guide explored various downsampling techniques and their implementation using Python libraries such as Pandas and Scikit-learn. Whether you're working with time-series data, images, or any other form of data, understanding downsampling methods empowers you to streamline your analysis, improve computational efficiency, and derive meaningful insights. However, theoretical knowledge alone is insufficient. Practical experience through real-world projects is paramount in solidifying understanding and honing expertise. With ProjectPro's comprehensive repository of over 270+ projects spanning data science and big data domains, aspiring professionals can apply their knowledge in practical scenarios.  

Download Materials

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Build ARCH and GARCH Models in Time Series using Python
In this Project we will build an ARCH and a GARCH model using Python

Build a Graph Based Recommendation System in Python -Part 1
Python Recommender Systems Project - Learn to build a graph based recommendation system in eCommerce to recommend products.

Build Piecewise and Spline Regression Models in Python
In this Regression Project, you will learn how to build a piecewise and spline regression model from scratch in Python to predict the points scored by a sports team.

End-to-End ML Model Monitoring using Airflow and Docker
In this MLOps Project, you will learn to build an end to end pipeline to monitor any changes in the predictive power of model or degradation of data.

Azure Text Analytics for Medical Search Engine Deployment
Microsoft Azure Project - Use Azure text analytics cognitive service to deploy a machine learning model into Azure Databricks

Learn to Build a Neural network from Scratch using NumPy
In this deep learning project, you will learn to build a neural network from scratch using NumPy

Learn to Build Generative Models Using PyTorch Autoencoders
In this deep learning project, you will learn how to build a Generative Model using Autoencoders in PyTorch

Build an AI Chatbot from Scratch using Keras Sequential Model
In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

Build a Multi Touch Attribution Machine Learning Model in Python
Identifying the ROI on marketing campaigns is an essential KPI for any business. In this ML project, you will learn to build a Multi Touch Attribution Model in Python to identify the ROI of various marketing efforts and their impact on conversions or sales..

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.