Explain the Accumulator in PySpark in Databricks

This recipe explains what the Accumulator in PySpark in Databricks

Recipe Objective - Explain the Accumulator in PySpark in Databricks?

In PySpark, the Accumulators are write-only and initialize once the variables where only tasks that are running on the workers are allowed to update and updates from the workers get propagated automatiacally to the driver program. The PySpark Accumulator is the shared variable that is used with the RDD and DataFrame to perform the sum and the counter operations similar to the Map-reduce counters. These variables are shared by all the executors to update and add information through the aggregation or the computative operations. The Accumulators have only the driver program is allowed to access Accumulator variable using the value property. The Accumulator in PySpark programming can be created using the "accumulator()" function from the SparkContext class and also Accumulators can be created for custom types using the AccumulatorParam class of the PySpark. The "sparkContext.accumulator()" is used to define the accumulator variables.

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is Accumulator and explains its usage in PySpark.

Implementing the Accumulator in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
Databricks-1

The Sparksession is imported into the environment to use Accumulator in the PySpark.

# Implementing the Accumulator in Databricks in PySpark
spark = SparkSession.builder.appName("Accumulator PySpark").getOrCreate()
accum = spark.sparkContext.accumulator(0)
Rdd = spark.sparkContext.parallelize([1,2,3,4,5])
Rdd.foreach(lambda x:accum.add(x))
print(accum.value)
accu_Sum = spark.sparkContext.accumulator(0)
def count_Fun(x):
global accu_Sum
accu_Sum+=x
Rdd.foreach(count_Fun)
print(accu_Sum.value)
accum_Count = spark.sparkContext.accumulator(0)
Rdd2 = spark.sparkContext.parallelize([1,2,3,4,5])
Rdd2.foreach(lambda x:accum_Count.add(1))
print(accum_Count.value)
Databricks-2

Databricks-3

The Spark Session is defined. The accumulator variable “Accum” is created using the "spark.sparkContext.accumulator(0)" with initial value 0 of type int and is used to sum all values in the RDD. Each element is iterated in the Add using the foreach() action and adding each element of the RDD to the "Accum" variable. The accumulator value is derived using the "Accum. value" property. the "rdd. foreach()" is executed on the workers and "Accum. value" is called from the PySpark driver program. The "accum_Count" accumulator value is created with the initial value of 0. Further, Rdd2 is defined and the value is printed using the print() function.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

AWS Snowflake Data Pipeline Example using Kinesis and Airflow
Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

Build an ETL Pipeline with DBT, Snowflake and Airflow
Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

Hadoop Project to Perform Hive Analytics using SQL and Scala
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

GCP Project to Explore Cloud Functions using Python Part 1
In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

Databricks Real-Time Streaming with Event Hubs and Snowflake
In this Azure Databricks Project, you will learn to use Azure Databricks, Event Hubs, and Snowflake to process and analyze real-time data, specifically in monitoring IoT devices.

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.