Explain the Accumulator in PySpark in Databricks

This recipe explains what the Accumulator in PySpark in Databricks
Last Updated: 26 Jul 2022

Get access to Big Data projects View all Big Data projects

PYSPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the Accumulator in PySpark in Databricks?

In PySpark, the Accumulators are write-only and initialize once the variables where only tasks that are running on the workers are allowed to update and updates from the workers get propagated automatiacally to the driver program. The PySpark Accumulator is the shared variable that is used with the RDD and DataFrame to perform the sum and the counter operations similar to the Map-reduce counters. These variables are shared by all the executors to update and add information through the aggregation or the computative operations. The Accumulators have only the driver program is allowed to access Accumulator variable using the value property. The Accumulator in PySpark programming can be created using the "accumulator()" function from the SparkContext class and also Accumulators can be created for custom types using the AccumulatorParam class of the PySpark. The "sparkContext.accumulator()" is used to define the accumulator variables.

Recipe Objective - Explain the Accumulator in PySpark in Databricks?
- System Requirements
- Implementing the Accumulator in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is Accumulator and explains its usage in PySpark.

Implementing the Accumulator in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession Databricks-1

The Sparksession is imported into the environment to use Accumulator in the PySpark.

# Implementing the Accumulator in Databricks in PySpark spark = SparkSession.builder.appName("Accumulator PySpark").getOrCreate() accum = spark.sparkContext.accumulator(0) Rdd = spark.sparkContext.parallelize([1,2,3,4,5]) Rdd.foreach(lambda x:accum.add(x)) print(accum.value) accu_Sum = spark.sparkContext.accumulator(0) def count_Fun(x): global accu_Sum accu_Sum+=x Rdd.foreach(count_Fun) print(accu_Sum.value) accum_Count = spark.sparkContext.accumulator(0) Rdd2 = spark.sparkContext.parallelize([1,2,3,4,5]) Rdd2.foreach(lambda x:accum_Count.add(1)) print(accum_Count.value) Databricks-2
Databricks-3

The Spark Session is defined. The accumulator variable “Accum” is created using the "spark.sparkContext.accumulator(0)" with initial value 0 of type int and is used to sum all values in the RDD. Each element is iterated in the Add using the foreach() action and adding each element of the RDD to the "Accum" variable. The accumulator value is derived using the "Accum. value" property. the "rdd. foreach()" is executed on the workers and "Accum. value" is called from the PySpark driver program. The "accum_Count" accumulator value is created with the initial value of 0. Further, Rdd2 is defined and the value is printed using the print() function.

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

GCP Project-Build Pipeline using Dataflow Apache Beam Python

In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

View Project Details

Build a Spark Streaming Pipeline with Synapse and CosmosDB

In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

View Project Details

Explain the Accumulator in PySpark in Databricks

Recipe Objective - Explain the Accumulator in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the Accumulator in Databricks in PySpark

Savvy Sahai

Relevant Projects

You might also like

Relevant Projects