Explain the Broadcast variables in PySpark

This recipe explains what the Broadcast variables in PySpark

Recipe Objective - Explain the Broadcast variables in PySpark?

In the PySpark Resilient Distributed Datasets(RDD) and DataFrame, the Broadcast variables are the read-only shared variables that are cached and are available on all nodes in the cluster in-order to access or use by the tasks. So, the PySpark distributes broadcast variables to workers using efficient broadcast algorithms to reduce the communication costs instead of sending this data along with every task. The creation and usage of the broadcast variables for the data that is shared across the multiple stages and tasks. The broadcast variables are not sent to the executors with "sc. broadcast(variable)" call instead they will be sent to the executors when they are first used. The PySpark Broadcast variable is created using the "broadcast(v)" method of SparkContext class. This method takes argument "v" that is to be broadcasted. The Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what are Broadcast variables and explains their usage in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

Implementing the Broadcast variables in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
Databricks-1

The Sparksession is imported into the environment to use Broadcast variables in the PySpark.

# Implementing the Broadcast variables in Databricks in PySpark
spark = SparkSession.builder.appName('Broadcast variables PySpark').getOrCreate()
States = {"DL":"Delhi", "RJ":"Rajasthan", "KA":"Karnataka"}
broadcast_states = spark.sparkContext.broadcast(States)
sample_data = [("Ram","Kapoor","INDIA","RJ"),
("Shyam","Bose","INDIA","DL"),
("Amit","Aggarwal","INDIA","RJ"),
("Rahul","Gupta","INDIA","KA")
]
sample_columns = ["firstname","lastname","country","state"]
dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using broadcast variables
def state_convert(code):
return broadcast_states.value[code]
Final_result = dataframe.rdd.map(lambda x: (x[0],x[1],x[2],state_convert(x[3]))).toDF(sample_columns)
Final_result.show(truncate=False)
Databricks-2

Databricks-3
Databricks-4

The Spark Session is defined. The "sample_data" and "sample_columns" are defined. Further, the DataFrame "data frame" is defined using the sample data and sample columns. The "States" and the "broadcast_states" are defined. The commonly used data (states) is defined in the Map variable and distributed variable using SparkContext.broadcast(). The state_convert() function is created which defines the broadcast variable and returns the "Final_result".

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Building Real-Time AWS Log Analytics Solution
In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

Build Classification and Clustering Models with PySpark and MLlib
In this PySpark Project, you will learn to implement pyspark classification and clustering model examples using Spark MLlib.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Build an ETL Pipeline with Talend for Export of Data from Cloud
In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

SQL Project for Data Analysis using Oracle Database-Part 5
In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

Learn Efficient Multi-Source Data Processing with Talend ETL
In this Talend ETL Project , you will create a multi-source ETL Pipeline to load data from multiple sources such as MySQL Database, Azure Database, and API to Snowflake cloud using Talend Jobs.

Learn to Build Regression Models with PySpark and Spark MLlib
In this PySpark Project, you will learn to implement regression machine learning models in SparkMLlib.

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.