Explain groupby filter and sort functions in PySpark in Databricks

The recipe explains the working of groupby filter and the sort functions in PySpark in Databricks, and how to implement them by using Python. The Daily operations of these functions is explained thoroughly with the help of example.
Last Updated: 19 Jan 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain groupBy(), filter() and sort() functions in PySpark in Databricks?

The groupby(), filter(), and sort() in Apache Spark are popularly used on dataframes for many day-to-day tasks and help in performing hard tasks. The groupBy() function in PySpark performs the operations on the dataframe group by using aggregate functions like sum() function that is it returns the Grouped Data object that contains the aggregate functions like sum(), max(), min(), avg(), mean(), count() etc. The filter() function in PySpark performs the filtration of the group upon a condition as defined by the user. The sort() function in PySpark performs the descending, or the ascending of the data is present in the dataframe.

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective - Explain groupBy(), filter() and sort() functions in PySpark in Databricks?
- System Requirements
- Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what groupBy(), filter() and sort() functions and how to perform them in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql.functions import sum, col, desc Databricks-1

The Sparksession, sum, col, and desc packages are imported in the environment to demonstrate groupby(), filter(), and sort() functions in PySpark.

# Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark Sample_Data = [("James","Sales","NY",90000,34,10000), ("Michael","Sales","NV",86000,56,20000), ("Robert","Sales","CA",81000,30,23000), ("Maria","Finance","CA",90000,24,23000), ("Raman","Finance","DE",99000,40,24000), ("Scott","Finance","NY",83000,36,19000), ("Jen","Finance","NY",79000,53,15000), ("Jeff","Marketing","NV",80000,25,18000), ("Kumar","Marketing","NJ",91000,50,21000) ] Sample_schema = ["employee_name","department","state","salary","age","bonus"] dataframe = spark.createDataFrame(data = Sample_Data, schema = Sample_schema) dataframe.printSchema() dataframe.show(truncate=False) # Using groupby(), filter() and sort() functions dataframe.groupBy("state") \ .agg(sum("salary").alias("sum_salary")) \ .filter(col("sum_salary") > 100000) \ .sort(desc("sum_salary")) \ .show() # Using sort() function in Descending order # Sort by descending order. dataframe.sort(desc("salary")).show() Databricks-2
Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_columns are defined. Using the groupBy() function, the dataframe is grouped based on the "state" column and calculates the aggregate sum of salary. The filter() function returns the "sum_salary" greater than 100000. The sort() function returns the "sum_salary."

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

View Project Details

Python and MongoDB Project for Beginners with Source Code-Part 2

In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

View Project Details

GCP Data Ingestion with SQL using Google Cloud Dataflow

In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

View Project Details

Migration of MySQL Databases to Cloud AWS using AWS DMS

IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

View Project Details

COVID-19 Data Analysis Project using Python and AWS Stack

COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

View Project Details

Explain groupby filter and sort functions in PySpark in Databricks

Recipe Objective - Explain groupBy(), filter() and sort() functions in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the groupBy(), filter() and sort() functions in Databricks in PySpark

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects