Explain ReduceByKey and GroupByKey in Apache Spark

This recipe helps you to understand how does ReduceByKey and GroupByKey works in Apache Spark and the difference between them

Recipe Objective - What is the difference between ReduceByKey and GroupByKey in Apache Spark?

The ReduceByKey function in apache spark is defined as the frequently used operation for transformations that usually perform data aggregation. The ReduceByKey function receives the key-value pairs as its input. Then it aggregates values based on the specified key and finally generates the dataset of (K, V) that is key-value pairs as an output. The GroupByKey function in apache spark is defined as the frequently used transformation operation that shuffles the data. The GroupByKey function receives key-value pairs or (K, V) as its input and group the values based on the key, and finally, it generates a dataset of (K, Iterable) pairs as its output.

System Requirements

  • Scala (2.12 version)
  • Apache Spark (3.1.1 version)

This recipe explains what ReduceByKey, GroupByKey is and what the difference is between the two.

Difference between ReduceByKey and GroupByKey.

The ReduceByKey implementation on any dataset containing key-value or (K, V) pairs so, before shuffling of the data, the pairs on the existing or same machine with the existing or same key are combined. During the implementation of the GroupByKey function on the dataset of key-value pairs, the shuffling of data is done according to key-value K in the other resilient distributed datasets or RDD in apache spark.

// Importing the package
import org.apache.spark.sql.SparkSession
Databricks-1

The spark SQL spark session package is imported into the environment to run reducebykey function.

// Defining the RDD from list
val dataRDD = Seq(("Assignment", 1),
("Ruderford", 1),
("Manik", 1),
("Travelling", 1),
("out", 1),
("Wonderland", 1),
("Assignment", 1),
("Ruderford", 1),
("Travelling", 1),
("in", 1),
("Wonderland", 1),
("Assignment", 1),
("Ruderford", 1))
// Parallelizing Data
val rdd1 = spark.sparkContext.parallelize(dataRDD)
// Using Reducebykey() function
val rdd2 = rdd1.reduceByKey(_ + _)
// Output the result
rdd2.collect.foreach(println)

Databricks-2

Databricks-3

The resilient distributed dataframe(RDD) is defined from the list. Further, the dataframe is parallelized using spark and then using reducebykey() function; it is normalized. The output is displayed.

In ReduceByKey implementation, unnecessary data transfer over the network does not happen; it occurs in a controlled way. In GroupByKey implementation, lots of unnecessary data transfer over the network is done.

The ReduceByKey function works only for resilient distributed datasets or RDDs that contain key and value pairs kind of elements. RDDs have a tuple or the Map as a data element. The GroupByKey function helps to group the datasets based on the key. The GroupByKey will result in the data shuffling when RDD is not already partitioned.

// Importing the package import org.apache.spark.sql.SparkSession

Databricks-1

The spark SQL spark session package is imported into the environment to run groupbykey function.

// Defining the RDD from list val dataRDD = sc.parallelize(Array(("India", 1), ("India", 2), ("USA", 1), | ("Indian Ocean", 1), ("USA", 4), ("USA", 9), | ("India", 8), ("India", 3), ("USA", 4), | ("Indian Ocean", 6), ("Indian Ocean", 9), ("Indian Ocean", 5)), 3) // Implementing groupbykey() function val rdd1 = x.groupByKey.collect() // Output the result rdd1.foreach(println)

Databricks-4

Databricks-5

Further, the Resilient Distributed Dataframe(RDD) from the list is parallelized using spark and then using GroupByKey() function, which is normalized. The output is displayed.

The ReduceByKey function uses an Implicit combiner for its tasks. The GroupByKeys function does not use a combiner for its task.

The ReduceByKey function takes two parameters: one for the SeqOp and the other for the CombOp. The GroupByKeys function does not use any parameters as functions, so, generally, the GroupByKeys function is followed by the Map or the flatMap.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM
Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

Create A Data Pipeline based on Messaging Using PySpark Hive
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.

Building Data Pipelines in Azure with Azure Synapse Analytics
In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

Databricks Real-Time Streaming with Event Hubs and Snowflake
In this Azure Databricks Project, you will learn to use Azure Databricks, Event Hubs, and Snowflake to process and analyze real-time data, specifically in monitoring IoT devices.

Flask API Big Data Project using Databricks and Unity Catalog
In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

Build Classification and Clustering Models with PySpark and MLlib
In this PySpark Project, you will learn to implement pyspark classification and clustering model examples using Spark MLlib.

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.