What is Spark RDD

In this tutorial, we shall be learning Spark RDD. The main logical data units of Spark are RDDs or Resilient Distributed Datasets.

What is Spark RDD?

The main logical data units of Spark are RDDs or Resilient Distributed Datasets. They are a distributed collection of things that are stored in memory or on discs of various cluster machines. A single RDD can be partitioned into numerous logical divisions, which can then be stored and processed on various cluster machines. RDDs are immutable by definition. An original RDD cannot be changed, but you can build new RDDs by performing coarse-grain operations on an existing RDD, such as transformations.

An RDD in Spark can be cached and reused for subsequent changes, which is a significant advantage for consumers. RDDs are considered to be lazily evaluated, which means that they postpone evaluation until it is absolutely necessary. This saves time and increases efficiency.

Access Snowflake Real Time Data Warehousing Project with Source Code

Let us take a look at some of the key features of Spark RDD –

Resilience
RDDs track data lineage information to automatically restore lost data in the event of a failure. This is also termed as fault tolerance.

Distributed
Data in an RDD is distributed among numerous nodes. It is dispersed among multiple cluster nodes.

Lazy evaluation
Even though you define data, it is not loaded into an RDD. When you call an operation, such as count or collect, or save the result to a file system, transformations are really computed.

Immutability
Data stored in an RDD is read-only; you cannot alter the data contained in the RDD. However, you can generate new RDDs by transforming existing RDDs.

In-memory computation
To provide swift access, an RDD keeps any immediate data generated in memory i.e RAM rather than on a disc.

Partitioning
Partitioning can be done on any existing RDD to generate mutable logical sections. This can be accomplished by performing transformations on the current partitions.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

GCP Project to Explore Cloud Functions using Python Part 1
In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

PySpark Project-Build a Data Pipeline using Kafka and Redshift
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Apache Kafka and AWS Redshift

Build Serverless Pipeline using AWS CDK and Lambda in Python
In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

PySpark ETL Project for Real-Time Data Processing
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.