Explain groupby filter and sort functions in PySpark in Databricks

The recipe explains the working of groupby filter and the sort functions in PySpark in Databricks, and how to implement them by using Python. The Daily operations of these functions is explained thoroughly with the help of example.
Last Updated: 19 Jan 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain groupBy(), filter() and sort() functions in PySpark in Databricks?

The groupby(), filter(), and sort() in Apache Spark are popularly used on dataframes for many day-to-day tasks and help in performing hard tasks. The groupBy() function in PySpark performs the operations on the dataframe group by using aggregate functions like sum() function that is it returns the Grouped Data object that contains the aggregate functions like sum(), max(), min(), avg(), mean(), count() etc. The filter() function in PySpark performs the filtration of the group upon a condition as defined by the user. The sort() function in PySpark performs the descending, or the ascending of the data is present in the dataframe.

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective - Explain groupBy(), filter() and sort() functions in PySpark in Databricks?
- System Requirements
- Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what groupBy(), filter() and sort() functions and how to perform them in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql.functions import sum, col, desc Databricks-1

The Sparksession, sum, col, and desc packages are imported in the environment to demonstrate groupby(), filter(), and sort() functions in PySpark.

# Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark Sample_Data = [("James","Sales","NY",90000,34,10000), ("Michael","Sales","NV",86000,56,20000), ("Robert","Sales","CA",81000,30,23000), ("Maria","Finance","CA",90000,24,23000), ("Raman","Finance","DE",99000,40,24000), ("Scott","Finance","NY",83000,36,19000), ("Jen","Finance","NY",79000,53,15000), ("Jeff","Marketing","NV",80000,25,18000), ("Kumar","Marketing","NJ",91000,50,21000) ] Sample_schema = ["employee_name","department","state","salary","age","bonus"] dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema) dataframe.printSchema() dataframe.show(truncate=False) # Using groupby(), filter() and sort() functions dataframe.groupBy("state") \ .agg(sum("salary").alias("sum_salary")) \ .filter(col("sum_salary") > 100000) \ .sort(desc("sum_salary")) \ .show() # Using sort() function in Descending order # Sort by descending order. dataframe.sort(desc("salary")).show() Databricks-2
Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_columns are defined. Using the groupBy() function, the dataframe is grouped based on the "state" column and calculates the aggregate sum of salary. The filter() function returns the "sum_salary" greater than 100000. The sort() function returns the "sum_salary."

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Explain groupby filter and sort functions in PySpark in Databricks

Recipe Objective - Explain groupBy(), filter() and sort() functions in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark

Ed Godalle

Relevant Projects

You might also like

Relevant Projects