Explain the Repartition and Coalesce functions in PySpark in Databricks

This recipe explains what the Repartition and Coalesce functions in PySpark in Databricks

Recipe Objective - Explain the Repartition() and Coalesce() functions in PySpark in Databricks?

In PySpark, the Repartition() function is widely used and defined as to increase or decrease the Resilient Distributed Dataframe(RDD) or DataFrame partitions. The Coalesce() widely used a dis defined to only decrease the number of the partitions efficiently. The PySpark repartition() and coalesce() functions are very expensive operations as they shuffle the data across many partitions, so the functions try to minimize using these as much as possible. The Resilient Distributed Datasets or RDDs are defined as the fundamental data structure of Apache PySpark. It was developed by The Apache Software Foundation. It is the immutable distributed collection of objects. In RDD, each dataset is divided into logical partitions which may be computed on different nodes of the cluster. The RDDs concept was launched in the year 2011. The Dataset is defined as a data structure in the SparkSQL that is strongly typed and is a map to the relational schema. It represents the structured queries with encoders and is an extension to dataframe API. Spark Dataset provides both the type safety and object-oriented programming interface. The Datasets concept was launched in the year 2015.

System Requirements

This recipe explains what is Repartition() and Coalesce() functions and explains their usage in PySpark.

Implementing the Repartition() and Coalesce() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
Databricks-1

The Sparksession is imported in the environment to use Repartition() and Coalesce() functions in the PySpark.

# Implementing the Repartition() and Coalesce() functions in Databricks in PySpark
spark = SparkSession.builder.appName('Repartition and Coalesce() PySpark') \
.master("local[5]").getOrCreate()
dataframe = spark.range(0,20)
print(dataframe.rdd.getNumPartitions())
spark.conf.set("spark.sql.shuffle.partitions", "500")
Rdd = spark.sparkContext.parallelize((0,20))
print("From local[5]"+str(Rdd.getNumPartitions()))
Rdd1 = spark.sparkContext.parallelize((0,25), 6)
print("parallelize : "+str(Rdd1.getNumPartitions()))
Rdd1.saveAsTextFile("/FileStore/tables/partition22")
# Using repartition() function
Rdd2 = Rdd1.repartition(5)
print("Repartition size : " + str(Rdd2.getNumPartitions()))
Rdd2.saveAsTextFile("/FileStore/tables/re-partition22")
# Using coalesce() function
Rdd3 = Rdd1.coalesce(5)
print("Repartition size : " + str(Rdd3.getNumPartitions()))
Rdd3.saveAsTextFile("/FileStore/tables/coalesce22")
Databricks-2

Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-7
Databricks-8
Databricks-8

The Spark Session is defined. The DataFrame "data frame" is defined using the spark range of 0 to 20. The "RDD" is created using the Spark Parallelism using the "spark.spark context.parallelize()" function. The "RDD2" is created using the Spark Parallelism using the "spark.sparkcontext.parallelize(Range(0,25 ),6)" which further distributes the RDD into 6 partitions and the data gets distributed. The repartition() function decreases the partitions from 10 to 5 by moving the data from all partitions and the repartition re-distributes the data from all the partitions which is a full shuffle leading to the very expensive operation when dealing with the billions and trillions of the data. The Coalesce() function is used only to reduce the number of the partitions and is an optimized or improved version of the repartition() function where the movement of the data across the partitions is lower.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

COVID-19 Data Analysis Project using Python and AWS Stack
COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

Talend Real-Time Project for ETL Process Automation
In this Talend Project, you will learn how to build an ETL pipeline in Talend Open Studio to automate the process of File Loading and Processing.

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Learn Real-Time Data Ingestion with Azure Purview
In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.