How to do performance tuning in spark

In this tutorial, we will go through some performance optimization techniques to be able to process data and solve complex problems even faster in spark.

How to do performance tuning in spark?


In this tutorial, we will go through some performance optimization techniques to be able to process data and solve complex problems even faster in spark.

Spark Performance tuning is the process of altering and optimizing system resources (CPU cores and memory), tuning various parameters, and following specific framework principles and best practices to increase the performance of Spark and PySpark applications. We are all aware that performance is equally vital during the development of any program. Many strategies can be used to optimize a Spark job, so let's look at them one by one.

Access Snowflake Real Time Data Warehousing Project with Source Code 

1) Serialization

Any distributed application's performance is heavily influenced by serialization. Spark utilizes the Java serializer by default. Most Spark tasks work in a pipeline, with one Spark job writing data to a file, followed by another Spark job reading the data, processing it, and writing it to another file for another Spark job to read. When you have a use case like this, you should write an intermediate file in a serialized and optimized format like Avro, Kryo, Parquet, and so on, because any transformations on these formats perform better than text, CSV, and JSON.
Eg: using apache Avro –


2) Using DataFrame/Dataset over RDD

RDD, DataFrame, and DataSet are the three types of APIs available in Spark. RDD is a low-level operating system with few optimization strategies. In most circumstances, DataFrame is the best option since it employs the catalyst optimizer, which generates a query plan that improves performance. In addition, DataFrame has a low manpower garbage collection overhead. DataSets are very type-safe, and their serialization includes the encoder. It also makes use of Tungsten as a binary serializer.
Because Spark doesn't know how to use optimization techniques, and RDD serializes and de-serializes data when it distributes over a cluster, using RDD directly causes performance concerns (repartition & shuffling). For Spark applications or other distributed systems, serialization and de-serialization are relatively expensive activities; we spend most of our time serializing data rather than executing actions, hence we strive to avoid utilizing RDD.

3) Caching and Persisting data

Persisting/caching in Spark is one of the most effective ways to boost the performance of Spark workloads. Spark provides an optimization technique to store the intermediate computation of a Spark DataFrame using the cache() and persist() methods so that they can be reused in subsequent actions. When you persist a dataset, each node saves its partitioned data in memory and reuses it in subsequent operations on the dataset. Spark's persisted data on nodes is fault-tolerant, which means that if a Dataset's partition is lost, it will be immediately recomputed using the original operations that formed it. When caching, use in-memory columnar format. You may further optimize Spark speed by tweaking the batchSize property. To store the cached data, Spark provides multiple storage levels; choose the one that best matches your cluster.

4) Reducing expensive shuffling operations

Spark utilizes a process called shuffling to disperse data among different executors and even machines. When we do specific transformation operations on RDD and DataFrame, such as groupByKey(), reducebyKey(), and join(), Spark shuffling occurs. Spark Shuffle is a costly procedure because it entails the following:
• Disk and Network I/O
• Data serialization and deserialization
A user can get the mistake out of memory if there is a lot of shuffling. To avoid this error, the amount of parallelism should be increased. We can't totally prevent shuffle operations, but we can try to decrease the amount of them and remove any that aren't being used. To customize the partitions of the shuffle, Spark provides the spark.sql.shuffle.partitions configurations. You can increase Spark performance by tuning this attribute.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Learn Data Processing with Spark SQL using Scala on AWS
In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

Hadoop Project to Perform Hive Analytics using SQL and Scala
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Graph Database Modelling using AWS Neptune and Gremlin
In this data analytics project, you will use AWS Neptune graph database and Gremlin query language to analyse various performance metrics of flights.

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.