Explain different ways of groupBy() in spark SQL

This recipe explains what different ways of groupBy() in spark SQL
Last Updated: 15 Dec 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: Explain different ways of groupBy() in spark SQL

In this recipe, we are going to learn about groupBy() in different ways in Detail.

Similar to SQL “GROUP BY” clause, Spark sql groupBy() function is used to collect the identical data into groups on DataFrame/Dataset and perform aggregate functions like count(),min(),max,avg(),mean() on the grouped data.

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective: Explain different ways of groupBy() in spark SQL

Implementation Info:

Databricks Community Edition click here
Spark - Scala
storage - Databricks File System(DBFS)

Planned Module of learning flows as below:

Create a test DataFrame
Aggregate functions using groupBy()
groupBy on multiple columns
Using multiple aggregate functions with groupBy using agg()
Using filter on aggregate data

1. Create a test DataFrame

Here, we are creating test DataFrame containing columns "employee_name", "department", "state", "salary", "age", "bonus". The toDF() functions is used to convert raw seq data to DataFrame.

import spark.implicits._ println("creation of sample Test DataFrame") val simpleData = Seq(("john","Sales","AP",90000,34,10000), ("mathew","Sales","AP",86000,56,20000), ("Robert","Sales","KA",81000,30,23000), ("Maria","Finance","KA",90000,24,23000), ("krishna","Finance","KA",99000,40,24000), ("shanthi","Finance","TL",83000,36,19000), ("Jenny","Finance","TL",79000,53,15000), ("Jaffa","Marketing","AP",80000,25,18000), ("Kumar","Marketing","TL",91000,50,21000)) val df = simpleData.toDF("employee_name","department","state","salary","age","bonus") df.printSchema() df.show(false)

2. Aggregate functions using groupBy()

In this, we are doing groupBy() by "department" and applying multiple aggregating functions as below

println("Aggregate functions using groupBy") df.groupBy("department").count().show() df.groupBy("department").min("salary").show() df.groupBy("department").max("salary").show() df.groupBy("department").avg("salary").show() df.groupBy("department").mean("salary").show()

3. groupBy() on multiple columns

In this we are doing groupBy() on "department","state" fields and getting sum of "salary" and "bonus" based on "department" and "state".

//groupBy on multiple DataFrame columns //GroupBy on multiple columns println("groupBy on multiple columns") df.groupBy("department","state") .sum("salary","bonus") .show(false)

4. Using multiple aggregate functions with groupBy using agg()

In this, we are doing groupBy() on the "department" field and using spark agg() process to use multiple aggregate functions to sum,avg, max of bonus, and salary.

println("using multipe aggregate functions with groupBy using agg()") import org.apache.spark.sql.functions._ df.groupBy("department") .agg( sum("salary").as("sum_salary"), avg("salary").as("avg_salary"), sum("bonus").as("sum_bonus"), max("bonus").as("max_bonus")) .show(false)

5. Using filter on aggregate data

In this, we are doing groupBy() on the "department" field and using spark agg() function to use multiple aggregate functions to sum,avg, max of bonus, and salary. And thereby using where clause filters only those records that have the bonus department's sum greater than 50000.

println("Using filter on aggregate data") df.groupBy("department") .agg( sum("salary").as("sum_salary"), avg("salary").as("avg_salary"), sum("bonus").as("sum_bonus"), max("bonus").as("max_bonus")) .where(col("sum_bonus") >= 50000) .show(false)

Conclusion

you have learned how to use groupBy() and aggregate functions on Spark DataFrame and how to run these on multiple columns and filter data on the aggregated column.

Download Materials

bigdata_01

bigdata_02

bigdata_03

bigdata_04

bigdata_05

bigdata_06

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More