Explain the Accumulator in PySpark in Databricks

This recipe explains what the Accumulator in PySpark in Databricks

Recipe Objective - Explain the Accumulator in PySpark in Databricks?

In PySpark, the Accumulators are write-only and initialize once the variables where only tasks that are running on the workers are allowed to update and updates from the workers get propagated automatiacally to the driver program. The PySpark Accumulator is the shared variable that is used with the RDD and DataFrame to perform the sum and the counter operations similar to the Map-reduce counters. These variables are shared by all the executors to update and add information through the aggregation or the computative operations. The Accumulators have only the driver program is allowed to access Accumulator variable using the value property. The Accumulator in PySpark programming can be created using the "accumulator()" function from the SparkContext class and also Accumulators can be created for custom types using the AccumulatorParam class of the PySpark. The "sparkContext.accumulator()" is used to define the accumulator variables.

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is Accumulator and explains its usage in PySpark.

Implementing the Accumulator in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
Databricks-1

The Sparksession is imported into the environment to use Accumulator in the PySpark.

# Implementing the Accumulator in Databricks in PySpark
spark = SparkSession.builder.appName("Accumulator PySpark").getOrCreate()
accum = spark.sparkContext.accumulator(0)
Rdd = spark.sparkContext.parallelize([1,2,3,4,5])
Rdd.foreach(lambda x:accum.add(x))
print(accum.value)
accu_Sum = spark.sparkContext.accumulator(0)
def count_Fun(x):
global accu_Sum
accu_Sum+=x
Rdd.foreach(count_Fun)
print(accu_Sum.value)
accum_Count = spark.sparkContext.accumulator(0)
Rdd2 = spark.sparkContext.parallelize([1,2,3,4,5])
Rdd2.foreach(lambda x:accum_Count.add(1))
print(accum_Count.value)
Databricks-2

Databricks-3

The Spark Session is defined. The accumulator variable “Accum” is created using the "spark.sparkContext.accumulator(0)" with initial value 0 of type int and is used to sum all values in the RDD. Each element is iterated in the Add using the foreach() action and adding each element of the RDD to the "Accum" variable. The accumulator value is derived using the "Accum. value" property. the "rdd. foreach()" is executed on the workers and "Accum. value" is called from the PySpark driver program. The "accum_Count" accumulator value is created with the initial value of 0. Further, Rdd2 is defined and the value is printed using the print() function.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

Build a Spark Streaming Pipeline with Synapse and CosmosDB
In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

EMR Serverless Example to Build a Search Engine for COVID19
In this AWS Project, create a search engine using the BM25 TF-IDF Algorithm that uses EMR Serverless for ad-hoc processing of a large amount of unstructured textual data.

SQL Project for Data Analysis using Oracle Database-Part 4
In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

Build Serverless Pipeline using AWS CDK and Lambda in Python
In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.