Explain collectset and collectlist aggregate functions in PySpark

The following tutorial explains how PySpark's collectset and collectlist aggregate functions work in Databricks, also the difference between both the functions is explained thoroughly in this tutorial
Last Updated: 12 May 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain collect_set() and collect_list() aggregate functions in PySpark in Databricks?

The Aggregate functions in Apache PySpark accept input as the Column type or the column name in the string and follow several other arguments based on the process and returning the Column type. The Aggregate functions operate on the group of rows and calculate the single return value for every group. The PySpark SQL Aggregate functions are further grouped as the “agg_funcs” in the Pyspark. The collect_set() function returns all values from the present input column with the duplicate values eliminated. The collect_list() function returns all the current input column values with the duplicates.

Recipe Objective - Explain collect_set() and collect_list() aggregate functions in PySpark in Databricks?
- System Requirements
- Implementing the collect_set() and collect_list() functions in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what are collect_set() and collect_list() functions and how to perform them in PySpark.

Implementing the collect_set() and collect_list() functions in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession from pyspark.sql.functions import collect_set, collect_list Databricks-1

The Sparksession, collect_set and collect_list packages are imported in the environment so as to perform first() and last() functions in PySpark.

Explore SQL Database Projects to Add them to Your Data Engineer Resume.

# Implementing the collect_set() and collect_list() functions in Databricks in PySpark spark = SparkSession.builder.appName('PySpark collect_set() and collect_list()').getOrCreate() Sample_Data = [("Rahul", "Technology", 8000), ("Prateek", "Finance", 7600), ("Ram", "Sales", 5100), ("Reetu", "Marketing", 4000), ("Himesh", "Sales", 2000), ("Shyam", "Finance", 3500), ("Harsh", "Finance", 4900), ("Ramesh", "Marketing", 4000), ("Raina", "Marketing", 3000), ("Ankit", "Sales", 5100) ] Sample_schema = ["employee_name", "department", "salary"] dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema) dataframe.printSchema() dataframe.show(truncate=False) # Using collect_set() function dataframe.select(collect_set("salary")).show(truncate=False) # Using collect_list() function dataframe.select(collect_list("salary")).show(truncate=False) Databricks-2
Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. The collect_set() function returns all values from the present salary column with the duplicate values eliminated. The last() function returns all the current salary column values with the duplicates.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Orchestrate Redshift ETL using AWS Glue and Step Functions

ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

View Project Details

Building Data Pipelines in Azure with Azure Synapse Analytics

In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 5

In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.

View Project Details

Yelp Data Processing Using Spark And Hive Part 1

In this big data project, you will learn how to process data using Spark and Hive as well as perform queries on Hive tables.

View Project Details

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks

In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

View Project Details

Retail Analytics Project Example using Sqoop, HDFS, and Hive

This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

View Project Details

Migration of MySQL Databases to Cloud AWS using AWS DMS

IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

View Project Details

Project-Driven Approach to PySpark Partitioning Best Practices

In this Big Data Project, you will learn to implement PySpark Partitioning Best Practices.

View Project Details

Build a Data Pipeline with Azure Synapse and Spark Pool

In this Azure Project, you will learn to build a Data Pipeline in Azure using Azure Synapse Analytics, Azure Storage, Azure Synapse Spark Pool to perform data transformations on an Airline dataset and visualize the results in Power BI.

View Project Details

Build an Analytical Platform for eCommerce using AWS Services

In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

View Project Details

Explain collectset and collectlist aggregate functions in PySpark

Recipe Objective - Explain collect_set() and collect_list() aggregate functions in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the collect_set() and collect_list() functions in Databricks in PySpark

Ed Godalle

Relevant Projects

You might also like

Relevant Projects