What is Spark RDD

In this tutorial, we shall be learning Spark RDD. The main logical data units of Spark are RDDs or Resilient Distributed Datasets.

What is Spark RDD?

The main logical data units of Spark are RDDs or Resilient Distributed Datasets. They are a distributed collection of things that are stored in memory or on discs of various cluster machines. A single RDD can be partitioned into numerous logical divisions, which can then be stored and processed on various cluster machines. RDDs are immutable by definition. An original RDD cannot be changed, but you can build new RDDs by performing coarse-grain operations on an existing RDD, such as transformations.

An RDD in Spark can be cached and reused for subsequent changes, which is a significant advantage for consumers. RDDs are considered to be lazily evaluated, which means that they postpone evaluation until it is absolutely necessary. This saves time and increases efficiency.

Access Snowflake Real Time Data Warehousing Project with Source Code

Let us take a look at some of the key features of Spark RDD –

Resilience
RDDs track data lineage information to automatically restore lost data in the event of a failure. This is also termed as fault tolerance.

Distributed
Data in an RDD is distributed among numerous nodes. It is dispersed among multiple cluster nodes.

Lazy evaluation
Even though you define data, it is not loaded into an RDD. When you call an operation, such as count or collect, or save the result to a file system, transformations are really computed.

Immutability
Data stored in an RDD is read-only; you cannot alter the data contained in the RDD. However, you can generate new RDDs by transforming existing RDDs.

In-memory computation
To provide swift access, an RDD keeps any immediate data generated in memory i.e RAM rather than on a disc.

Partitioning
Partitioning can be done on any existing RDD to generate mutable logical sections. This can be accomplished by performing transformations on the current partitions.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Build Streaming Data Pipeline using Azure Stream Analytics
In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

Build a real-time Streaming Data Pipeline using Flink and Kinesis
In this big data project on AWS, you will learn how to run an Apache Flink Python application for a real-time streaming platform using Amazon Kinesis.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

PySpark ETL Project for Real-Time Data Processing
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing

COVID-19 Data Analysis Project using Python and AWS Stack
COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

Learn to Create Delta Live Tables in Azure Databricks
In this Microsoft Azure Project, you will learn how to create delta live tables in Azure Databricks.

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, you will learn how to process data using Spark and Hive as well as perform queries on Hive tables.

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.