How to do performance tuning in spark

In this tutorial, we will go through some performance optimization techniques to be able to process data and solve complex problems even faster in spark.
Last Updated: 28 Jul 2022

Get access to Big Data projects View all Big Data projects

BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

How to do performance tuning in spark?

In this tutorial, we will go through some performance optimization techniques to be able to process data and solve complex problems even faster in spark.

Spark Performance tuning is the process of altering and optimizing system resources (CPU cores and memory), tuning various parameters, and following specific framework principles and best practices to increase the performance of Spark and PySpark applications. We are all aware that performance is equally vital during the development of any program. Many strategies can be used to optimize a Spark job, so let's look at them one by one.

Access Snowflake Real Time Data Warehousing Project with Source Code

1) Serialization

Any distributed application's performance is heavily influenced by serialization. Spark utilizes the Java serializer by default. Most Spark tasks work in a pipeline, with one Spark job writing data to a file, followed by another Spark job reading the data, processing it, and writing it to another file for another Spark job to read. When you have a use case like this, you should write an intermediate file in a serialized and optimized format like Avro, Kryo, Parquet, and so on, because any transformations on these formats perform better than text, CSV, and JSON.
Eg: using apache Avro –

2) Using DataFrame/Dataset over RDD

RDD, DataFrame, and DataSet are the three types of APIs available in Spark. RDD is a low-level operating system with few optimization strategies. In most circumstances, DataFrame is the best option since it employs the catalyst optimizer, which generates a query plan that improves performance. In addition, DataFrame has a low manpower garbage collection overhead. DataSets are very type-safe, and their serialization includes the encoder. It also makes use of Tungsten as a binary serializer.
Because Spark doesn't know how to use optimization techniques, and RDD serializes and de-serializes data when it distributes over a cluster, using RDD directly causes performance concerns (repartition & shuffling). For Spark applications or other distributed systems, serialization and de-serialization are relatively expensive activities; we spend most of our time serializing data rather than executing actions, hence we strive to avoid utilizing RDD.

3) Caching and Persisting data

Persisting/caching in Spark is one of the most effective ways to boost the performance of Spark workloads. Spark provides an optimization technique to store the intermediate computation of a Spark DataFrame using the cache() and persist() methods so that they can be reused in subsequent actions. When you persist a dataset, each node saves its partitioned data in memory and reuses it in subsequent operations on the dataset. Spark's persisted data on nodes is fault-tolerant, which means that if a Dataset's partition is lost, it will be immediately recomputed using the original operations that formed it. When caching, use in-memory columnar format. You may further optimize Spark speed by tweaking the batchSize property. To store the cached data, Spark provides multiple storage levels; choose the one that best matches your cluster.

4) Reducing expensive shuffling operations

Spark utilizes a process called shuffling to disperse data among different executors and even machines. When we do specific transformation operations on RDD and DataFrame, such as groupByKey(), reducebyKey(), and join(), Spark shuffling occurs. Spark Shuffle is a costly procedure because it entails the following:
• Disk and Network I/O
• Data serialization and deserialization
A user can get the mistake out of memory if there is a lot of shuffling. To avoid this error, the amount of parallelism should be increased. We can't totally prevent shuffle operations, but we can try to decrease the amount of them and remove any that aren't being used. To customize the partitions of the shuffle, Spark provides the spark.sql.shuffle.partitions configurations. You can increase Spark performance by tuning this attribute.

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

GCP Project to Explore Cloud Functions using Python Part 1

In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

View Project Details

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks

In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

View Project Details

Big Data Project for Solving Small File Problem in Hadoop Spark

This big data project focuses on solving the small file problem to optimize data processing efficiency by leveraging Apache Hadoop and Spark within AWS EMR by implementing and demonstrating effective techniques for handling large numbers of small files.

View Project Details

How to do performance tuning in spark