Explain groupby filter and sort functions in PySpark in Databricks

The recipe explains the working of groupby filter and the sort functions in PySpark in Databricks, and how to implement them by using Python. The Daily operations of these functions is explained thoroughly with the help of example.

Recipe Objective - Explain groupBy(), filter() and sort() functions in PySpark in Databricks?

The groupby(), filter(), and sort() in Apache Spark are popularly used on dataframes for many day-to-day tasks and help in performing hard tasks. The groupBy() function in PySpark performs the operations on the dataframe group by using aggregate functions like sum() function that is it returns the Grouped Data object that contains the aggregate functions like sum(), max(), min(), avg(), mean(), count() etc. The filter() function in PySpark performs the filtration of the group upon a condition as defined by the user. The sort() function in PySpark performs the descending, or the ascending of the data is present in the dataframe.

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what groupBy(), filter() and sort() functions and how to perform them in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql.functions import sum, col, desc
Databricks-1

The Sparksession, sum, col, and desc packages are imported in the environment to demonstrate groupby(), filter(), and sort() functions in PySpark.

# Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark
Sample_Data = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NV",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","DE",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","NV",80000,25,18000),
("Kumar","Marketing","NJ",91000,50,21000)
]
Sample_schema = ["employee_name","department","state","salary","age","bonus"]
dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using groupby(), filter() and sort() functions
dataframe.groupBy("state") \
.agg(sum("salary").alias("sum_salary")) \
.filter(col("sum_salary") > 100000) \
.sort(desc("sum_salary")) \
.show()
# Using sort() function in Descending order
# Sort by descending order.
dataframe.sort(desc("salary")).show()
Databricks-2

Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_columns are defined. Using the groupBy() function, the dataframe is grouped based on the "state" column and calculates the aggregate sum of salary. The filter() function returns the "sum_salary" greater than 100000. The sort() function returns the "sum_salary."

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

COVID-19 Data Analysis Project using Python and AWS Stack
COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

PySpark ETL Project for Real-Time Data Processing
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.