Explain the Accumulator in PySpark in Databricks

This recipe explains what the Accumulator in PySpark in Databricks
Last Updated: 26 Jul 2022

Get access to Big Data projects View all Big Data projects

PYSPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the Accumulator in PySpark in Databricks?

In PySpark, the Accumulators are write-only and initialize once the variables where only tasks that are running on the workers are allowed to update and updates from the workers get propagated automatiacally to the driver program. The PySpark Accumulator is the shared variable that is used with the RDD and DataFrame to perform the sum and the counter operations similar to the Map-reduce counters. These variables are shared by all the executors to update and add information through the aggregation or the computative operations. The Accumulators have only the driver program is allowed to access Accumulator variable using the value property. The Accumulator in PySpark programming can be created using the "accumulator()" function from the SparkContext class and also Accumulators can be created for custom types using the AccumulatorParam class of the PySpark. The "sparkContext.accumulator()" is used to define the accumulator variables.

Recipe Objective - Explain the Accumulator in PySpark in Databricks?
- System Requirements
- Implementing the Accumulator in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is Accumulator and explains its usage in PySpark.

Implementing the Accumulator in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession Databricks-1

The Sparksession is imported into the environment to use Accumulator in the PySpark.

# Implementing the Accumulator in Databricks in PySpark spark = SparkSession.builder.appName("Accumulator PySpark").getOrCreate() accum = spark.sparkContext.accumulator(0) Rdd = spark.sparkContext.parallelize([1,2,3,4,5]) Rdd.foreach(lambda x:accum.add(x)) print(accum.value) accu_Sum = spark.sparkContext.accumulator(0) def count_Fun(x): global accu_Sum accu_Sum+=x Rdd.foreach(count_Fun) print(accu_Sum.value) accum_Count = spark.sparkContext.accumulator(0) Rdd2 = spark.sparkContext.parallelize([1,2,3,4,5]) Rdd2.foreach(lambda x:accum_Count.add(1)) print(accum_Count.value) Databricks-2
Databricks-3

The Spark Session is defined. The accumulator variable “Accum” is created using the "spark.sparkContext.accumulator(0)" with initial value 0 of type int and is used to sum all values in the RDD. Each element is iterated in the Add using the foreach() action and adding each element of the RDD to the "Accum" variable. The accumulator value is derived using the "Accum. value" property. the "rdd. foreach()" is executed on the workers and "Accum. value" is called from the PySpark driver program. The "accum_Count" accumulator value is created with the initial value of 0. Further, Rdd2 is defined and the value is printed using the print() function.

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

AWS Snowflake Data Pipeline Example using Kinesis and Airflow

Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

View Project Details

Build an ETL Pipeline with DBT, Snowflake and Airflow

Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

View Project Details

Build an ETL Pipeline on EMR using AWS CDK and Power BI

In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

View Project Details

Hadoop Project to Perform Hive Analytics using SQL and Scala

In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

View Project Details

GCP Project to Explore Cloud Functions using Python Part 1

In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

View Project Details

Hands-On Real Time PySpark Project for Beginners

In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

View Project Details

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

View Project Details

Snowflake Real Time Data Warehouse Project for Beginners-1

In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

View Project Details

Databricks Real-Time Streaming with Event Hubs and Snowflake

In this Azure Databricks Project, you will learn to use Azure Databricks, Event Hubs, and Snowflake to process and analyze real-time data, specifically in monitoring IoT devices.

View Project Details

GCP Project to Learn using BigQuery for Exploring Data

Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

View Project Details

Explain the Accumulator in PySpark in Databricks

Recipe Objective - Explain the Accumulator in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the Accumulator in Databricks in PySpark

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects