How to do performance tuning in spark

In this tutorial, we will go through some performance optimization techniques to be able to process data and solve complex problems even faster in spark.

How to do performance tuning in spark?


In this tutorial, we will go through some performance optimization techniques to be able to process data and solve complex problems even faster in spark.

Spark Performance tuning is the process of altering and optimizing system resources (CPU cores and memory), tuning various parameters, and following specific framework principles and best practices to increase the performance of Spark and PySpark applications. We are all aware that performance is equally vital during the development of any program. Many strategies can be used to optimize a Spark job, so let's look at them one by one.

Access Snowflake Real Time Data Warehousing Project with Source Code 

1) Serialization

Any distributed application's performance is heavily influenced by serialization. Spark utilizes the Java serializer by default. Most Spark tasks work in a pipeline, with one Spark job writing data to a file, followed by another Spark job reading the data, processing it, and writing it to another file for another Spark job to read. When you have a use case like this, you should write an intermediate file in a serialized and optimized format like Avro, Kryo, Parquet, and so on, because any transformations on these formats perform better than text, CSV, and JSON.
Eg: using apache Avro –


2) Using DataFrame/Dataset over RDD

RDD, DataFrame, and DataSet are the three types of APIs available in Spark. RDD is a low-level operating system with few optimization strategies. In most circumstances, DataFrame is the best option since it employs the catalyst optimizer, which generates a query plan that improves performance. In addition, DataFrame has a low manpower garbage collection overhead. DataSets are very type-safe, and their serialization includes the encoder. It also makes use of Tungsten as a binary serializer.
Because Spark doesn't know how to use optimization techniques, and RDD serializes and de-serializes data when it distributes over a cluster, using RDD directly causes performance concerns (repartition & shuffling). For Spark applications or other distributed systems, serialization and de-serialization are relatively expensive activities; we spend most of our time serializing data rather than executing actions, hence we strive to avoid utilizing RDD.

3) Caching and Persisting data

Persisting/caching in Spark is one of the most effective ways to boost the performance of Spark workloads. Spark provides an optimization technique to store the intermediate computation of a Spark DataFrame using the cache() and persist() methods so that they can be reused in subsequent actions. When you persist a dataset, each node saves its partitioned data in memory and reuses it in subsequent operations on the dataset. Spark's persisted data on nodes is fault-tolerant, which means that if a Dataset's partition is lost, it will be immediately recomputed using the original operations that formed it. When caching, use in-memory columnar format. You may further optimize Spark speed by tweaking the batchSize property. To store the cached data, Spark provides multiple storage levels; choose the one that best matches your cluster.

4) Reducing expensive shuffling operations

Spark utilizes a process called shuffling to disperse data among different executors and even machines. When we do specific transformation operations on RDD and DataFrame, such as groupByKey(), reducebyKey(), and join(), Spark shuffling occurs. Spark Shuffle is a costly procedure because it entails the following:
• Disk and Network I/O
• Data serialization and deserialization
A user can get the mistake out of memory if there is a lot of shuffling. To avoid this error, the amount of parallelism should be increased. We can't totally prevent shuffle operations, but we can try to decrease the amount of them and remove any that aren't being used. To customize the partitions of the shuffle, Spark provides the spark.sql.shuffle.partitions configurations. You can increase Spark performance by tuning this attribute.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

GCP Project to Explore Cloud Functions using Python Part 1
In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Big Data Project for Solving Small File Problem in Hadoop Spark
This big data project focuses on solving the small file problem to optimize data processing efficiency by leveraging Apache Hadoop and Spark within AWS EMR by implementing and demonstrating effective techniques for handling large numbers of small files.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Hadoop Project to Perform Hive Analytics using SQL and Scala
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

PySpark Project-Build a Data Pipeline using Kafka and Redshift
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Apache Kafka and AWS Redshift

Build a Real-Time Spark Streaming Pipeline on AWS using Scala
In this Spark Streaming project, you will build a real-time spark streaming pipeline on AWS using Scala and Python.

Building Real-Time AWS Log Analytics Solution
In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

Snowflake Azure Project to build real-time Twitter feed dashboard
In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.