Explain different ways of groupBy() in spark SQL

This recipe explains what different ways of groupBy() in spark SQL

Recipe Objective: Explain different ways of groupBy() in spark SQL

In this recipe, we are going to learn about groupBy() in different ways in Detail.

Similar to SQL “GROUP BY” clause, Spark sql groupBy() function is used to collect the identical data into groups on DataFrame/Dataset and perform aggregate functions like count(),min(),max,avg(),mean() on the grouped data.

Learn Spark SQL for Relational Big Data Procesing

Implementation Info:

  1. Databricks Community Edition click here
  2. Spark - Scala
  3. storage - Databricks File System(DBFS)

Planned Module of learning flows as below:

  1. Create a test DataFrame
  2. Aggregate functions using groupBy()
  3. groupBy on multiple columns
  4. Using multiple aggregate functions with groupBy using agg()
  5. Using filter on aggregate data

1. Create a test DataFrame

Here, we are creating test DataFrame containing columns "employee_name", "department", "state", "salary", "age", "bonus". The toDF() functions is used to convert raw seq data to DataFrame.

import spark.implicits._ println("creation of sample Test DataFrame") val simpleData = Seq(("john","Sales","AP",90000,34,10000), ("mathew","Sales","AP",86000,56,20000), ("Robert","Sales","KA",81000,30,23000), ("Maria","Finance","KA",90000,24,23000), ("krishna","Finance","KA",99000,40,24000), ("shanthi","Finance","TL",83000,36,19000), ("Jenny","Finance","TL",79000,53,15000), ("Jaffa","Marketing","AP",80000,25,18000), ("Kumar","Marketing","TL",91000,50,21000)) val df = simpleData.toDF("employee_name","department","state","salary","age","bonus") df.printSchema() df.show(false)

bigdata_01.PNG

2. Aggregate functions using groupBy()

In this, we are doing groupBy() by "department" and applying multiple aggregating functions as below

println("Aggregate functions using groupBy") df.groupBy("department").count().show() df.groupBy("department").min("salary").show() df.groupBy("department").max("salary").show() df.groupBy("department").avg("salary").show() df.groupBy("department").mean("salary").show()

bigdata_02.PNG

bigdata_03.PNG

3. groupBy() on multiple columns

In this we are doing groupBy() on "department","state" fields and getting sum of "salary" and "bonus" based on "department" and "state".

//groupBy on multiple DataFrame columns //GroupBy on multiple columns println("groupBy on multiple columns") df.groupBy("department","state") .sum("salary","bonus") .show(false)

bigdata_04.PNG

4. Using multiple aggregate functions with groupBy using agg()

In this, we are doing groupBy() on the "department" field and using spark agg() process to use multiple aggregate functions to sum,avg, max of bonus, and salary.

println("using multipe aggregate functions with groupBy using agg()") import org.apache.spark.sql.functions._ df.groupBy("department") .agg( sum("salary").as("sum_salary"), avg("salary").as("avg_salary"), sum("bonus").as("sum_bonus"), max("bonus").as("max_bonus")) .show(false)

bigdata_05.PNG

5. Using filter on aggregate data

In this, we are doing groupBy() on the "department" field and using spark agg() function to use multiple aggregate functions to sum,avg, max of bonus, and salary. And thereby using where clause filters only those records that have the bonus department's sum greater than 50000.

println("Using filter on aggregate data") df.groupBy("department") .agg( sum("salary").as("sum_salary"), avg("salary").as("avg_salary"), sum("bonus").as("sum_bonus"), max("bonus").as("max_bonus")) .where(col("sum_bonus") >= 50000) .show(false)

bigdata_06.PNG

Conclusion

you have learned how to use groupBy() and aggregate functions on Spark DataFrame and how to run these on multiple columns and filter data on the aggregated column.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

SQL Project for Data Analysis using Oracle Database-Part 3
In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Getting Started with Pyspark on AWS EMR and Athena
In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

Build an AWS ETL Data Pipeline in Python on YouTube Data
AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda

Azure Data Factory and Databricks End-to-End Project
Azure Data Factory and Databricks End-to-End Project to implement analytics on trip transaction data using Azure Services such as Data Factory, ADLS Gen2, and Databricks, with a focus on data transformation and pipeline resiliency.

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

Airline Dataset Analysis using PySpark GraphFrames in Python
In this PySpark project, you will perform airline dataset analysis using graphframes in Python to find structural motifs, the shortest route between cities, and rank airports with PageRank.

Learn Data Processing with Spark SQL using Scala on AWS
In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.