Explain groupby filter and sort functions in PySpark in Databricks

The recipe explains the working of groupby filter and the sort functions in PySpark in Databricks, and how to implement them by using Python. The Daily operations of these functions is explained thoroughly with the help of example.

Recipe Objective - Explain groupBy(), filter() and sort() functions in PySpark in Databricks?

The groupby(), filter(), and sort() in Apache Spark are popularly used on dataframes for many day-to-day tasks and help in performing hard tasks. The groupBy() function in PySpark performs the operations on the dataframe group by using aggregate functions like sum() function that is it returns the Grouped Data object that contains the aggregate functions like sum(), max(), min(), avg(), mean(), count() etc. The filter() function in PySpark performs the filtration of the group upon a condition as defined by the user. The sort() function in PySpark performs the descending, or the ascending of the data is present in the dataframe.

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what groupBy(), filter() and sort() functions and how to perform them in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql.functions import sum, col, desc
Databricks-1

The Sparksession, sum, col, and desc packages are imported in the environment to demonstrate groupby(), filter(), and sort() functions in PySpark.

# Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark
Sample_Data = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NV",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","DE",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","NV",80000,25,18000),
("Kumar","Marketing","NJ",91000,50,21000)
]
Sample_schema = ["employee_name","department","state","salary","age","bonus"]
dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using groupby(), filter() and sort() functions
dataframe.groupBy("state") \
.agg(sum("salary").alias("sum_salary")) \
.filter(col("sum_salary") > 100000) \
.sort(desc("sum_salary")) \
.show()
# Using sort() function in Descending order
# Sort by descending order.
dataframe.sort(desc("salary")).show()
Databricks-2

Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_columns are defined. Using the groupBy() function, the dataframe is grouped based on the "state" column and calculates the aggregate sum of salary. The filter() function returns the "sum_salary" greater than 100000. The sort() function returns the "sum_salary."

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Create A Data Pipeline based on Messaging Using PySpark Hive
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

Build an AWS ETL Data Pipeline in Python on YouTube Data
AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.