What is Spark RDD

In this tutorial, we shall be learning Spark RDD. The main logical data units of Spark are RDDs or Resilient Distributed Datasets.
Last Updated: 28 Jul 2022

Get access to Big Data projects View all Big Data projects

BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

What is Spark RDD?

The main logical data units of Spark are RDDs or Resilient Distributed Datasets. They are a distributed collection of things that are stored in memory or on discs of various cluster machines. A single RDD can be partitioned into numerous logical divisions, which can then be stored and processed on various cluster machines. RDDs are immutable by definition. An original RDD cannot be changed, but you can build new RDDs by performing coarse-grain operations on an existing RDD, such as transformations.

An RDD in Spark can be cached and reused for subsequent changes, which is a significant advantage for consumers. RDDs are considered to be lazily evaluated, which means that they postpone evaluation until it is absolutely necessary. This saves time and increases efficiency.

Access Snowflake Real Time Data Warehousing Project with Source Code

Let us take a look at some of the key features of Spark RDD –

Resilience
RDDs track data lineage information to automatically restore lost data in the event of a failure. This is also termed as fault tolerance.

Distributed
Data in an RDD is distributed among numerous nodes. It is dispersed among multiple cluster nodes.

Lazy evaluation
Even though you define data, it is not loaded into an RDD. When you call an operation, such as count or collect, or save the result to a file system, transformations are really computed.

Immutability
Data stored in an RDD is read-only; you cannot alter the data contained in the RDD. However, you can generate new RDDs by transforming existing RDDs.

In-memory computation
To provide swift access, an RDD keeps any immediate data generated in memory i.e RAM rather than on a disc.

Partitioning
Partitioning can be done on any existing RDD to generate mutable logical sections. This can be accomplished by performing transformations on the current partitions.

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build Streaming Data Pipeline using Azure Stream Analytics

In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

View Project Details

Deploy an Application to Kubernetes in Google Cloud using GKE

In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

View Project Details

Build a real-time Streaming Data Pipeline using Flink and Kinesis

In this big data project on AWS, you will learn how to run an Apache Flink Python application for a real-time streaming platform using Amazon Kinesis.

View Project Details

AWS Project-Website Monitoring using AWS Lambda and Aurora

In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

View Project Details

PySpark ETL Project for Real-Time Data Processing

In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing

View Project Details

COVID-19 Data Analysis Project using Python and AWS Stack

COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

View Project Details

What is Spark RDD

What is Spark RDD?

Ed Godalle

Relevant Projects

You might also like

Relevant Projects