Explain collectset and collectlist aggregate functions in PySpark

The following tutorial explains how PySpark's collectset and collectlist aggregate functions work in Databricks, also the difference between both the functions is explained thoroughly in this tutorial

Recipe Objective - Explain collect_set() and collect_list() aggregate functions in PySpark in Databricks?

The Aggregate functions in Apache PySpark accept input as the Column type or the column name in the string and follow several other arguments based on the process and returning the Column type. The Aggregate functions operate on the group of rows and calculate the single return value for every group. The PySpark SQL Aggregate functions are further grouped as the “agg_funcs” in the Pyspark. The collect_set() function returns all values from the present input column with the duplicate values eliminated. The collect_list() function returns all the current input column values with the duplicates.

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what are collect_set() and collect_list() functions and how to perform them in PySpark.

Implementing the collect_set() and collect_list() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_set, collect_list
Databricks-1

The Sparksession, collect_set and collect_list packages are imported in the environment so as to perform first() and last() functions in PySpark.

Explore SQL Database Projects to Add them to Your Data Engineer Resume.

# Implementing the collect_set() and collect_list() functions in Databricks in PySpark
spark = SparkSession.builder.appName('PySpark collect_set() and collect_list()').getOrCreate()
Sample_Data = [("Rahul", "Technology", 8000),
("Prateek", "Finance", 7600),
("Ram", "Sales", 5100),
("Reetu", "Marketing", 4000),
("Himesh", "Sales", 2000),
("Shyam", "Finance", 3500),
("Harsh", "Finance", 4900),
("Ramesh", "Marketing", 4000),
("Raina", "Marketing", 3000),
("Ankit", "Sales", 5100)
]
Sample_schema = ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using collect_set() function
dataframe.select(collect_set("salary")).show(truncate=False)
# Using collect_list() function
dataframe.select(collect_list("salary")).show(truncate=False)
Databricks-2

Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. The collect_set() function returns all values from the present salary column with the duplicate values eliminated. The last() function returns all the current salary column values with the duplicates.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

Building Data Pipelines in Azure with Azure Synapse Analytics
In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

SQL Project for Data Analysis using Oracle Database-Part 5
In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, you will learn how to process data using Spark and Hive as well as perform queries on Hive tables.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

Project-Driven Approach to PySpark Partitioning Best Practices
In this Big Data Project, you will learn to implement PySpark Partitioning Best Practices.

Build a Data Pipeline with Azure Synapse and Spark Pool
In this Azure Project, you will learn to build a Data Pipeline in Azure using Azure Synapse Analytics, Azure Storage, Azure Synapse Spark Pool to perform data transformations on an Airline dataset and visualize the results in Power BI.

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.