Explain the Repartition and Coalesce functions in PySpark in Databricks

This recipe explains what the Repartition and Coalesce functions in PySpark in Databricks

Recipe Objective - Explain the Repartition() and Coalesce() functions in PySpark in Databricks?

In PySpark, the Repartition() function is widely used and defined as to increase or decrease the Resilient Distributed Dataframe(RDD) or DataFrame partitions. The Coalesce() widely used a dis defined to only decrease the number of the partitions efficiently. The PySpark repartition() and coalesce() functions are very expensive operations as they shuffle the data across many partitions, so the functions try to minimize using these as much as possible. The Resilient Distributed Datasets or RDDs are defined as the fundamental data structure of Apache PySpark. It was developed by The Apache Software Foundation. It is the immutable distributed collection of objects. In RDD, each dataset is divided into logical partitions which may be computed on different nodes of the cluster. The RDDs concept was launched in the year 2011. The Dataset is defined as a data structure in the SparkSQL that is strongly typed and is a map to the relational schema. It represents the structured queries with encoders and is an extension to dataframe API. Spark Dataset provides both the type safety and object-oriented programming interface. The Datasets concept was launched in the year 2015.

System Requirements

This recipe explains what is Repartition() and Coalesce() functions and explains their usage in PySpark.

Implementing the Repartition() and Coalesce() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
Databricks-1

The Sparksession is imported in the environment to use Repartition() and Coalesce() functions in the PySpark.

# Implementing the Repartition() and Coalesce() functions in Databricks in PySpark
spark = SparkSession.builder.appName('Repartition and Coalesce() PySpark') \
.master("local[5]").getOrCreate()
dataframe = spark.range(0,20)
print(dataframe.rdd.getNumPartitions())
spark.conf.set("spark.sql.shuffle.partitions", "500")
Rdd = spark.sparkContext.parallelize((0,20))
print("From local[5]"+str(Rdd.getNumPartitions()))
Rdd1 = spark.sparkContext.parallelize((0,25), 6)
print("parallelize : "+str(Rdd1.getNumPartitions()))
Rdd1.saveAsTextFile("/FileStore/tables/partition22")
# Using repartition() function
Rdd2 = Rdd1.repartition(5)
print("Repartition size : " + str(Rdd2.getNumPartitions()))
Rdd2.saveAsTextFile("/FileStore/tables/re-partition22")
# Using coalesce() function
Rdd3 = Rdd1.coalesce(5)
print("Repartition size : " + str(Rdd3.getNumPartitions()))
Rdd3.saveAsTextFile("/FileStore/tables/coalesce22")
Databricks-2

Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-7
Databricks-8
Databricks-8

The Spark Session is defined. The DataFrame "data frame" is defined using the spark range of 0 to 20. The "RDD" is created using the Spark Parallelism using the "spark.spark context.parallelize()" function. The "RDD2" is created using the Spark Parallelism using the "spark.sparkcontext.parallelize(Range(0,25 ),6)" which further distributes the RDD into 6 partitions and the data gets distributed. The repartition() function decreases the partitions from 10 to 5 by moving the data from all partitions and the repartition re-distributes the data from all the partitions which is a full shuffle leading to the very expensive operation when dealing with the billions and trillions of the data. The Coalesce() function is used only to reduce the number of the partitions and is an optimized or improved version of the repartition() function where the movement of the data across the partitions is lower.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Flask API Big Data Project using Databricks and Unity Catalog
In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

Streaming Data Pipeline using Spark, HBase and Phoenix
Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

Databricks Data Lineage and Replication Management
Databricks Project on data lineage and replication management to help you optimize your data management practices | ProjectPro

Airline Dataset Analysis using PySpark GraphFrames in Python
In this PySpark project, you will perform airline dataset analysis using graphframes in Python to find structural motifs, the shortest route between cities, and rank airports with PageRank.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

How to deal with slowly changing dimensions using snowflake?
Implement Slowly Changing Dimensions using Snowflake Method - Build Type 1 and Type 2 SCD in Snowflake using the Stream and Task Functionalities

Learn Data Processing with Spark SQL using Scala on AWS
In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

Snowflake Azure Project to build real-time Twitter feed dashboard
In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.