How to use Spark Parallelize

In this tutorial, we will learn how to use Spark Parallelize, especially how to use parallelize to generate RDDs and how to create an empty RDD using PySpark.

How to use Spark Parallelize?

PySpark parallelize() is a SparkContext method that creates an RDD from a list collection. In this article, we will learn how to use parallelize to generate RDDs and how to create an empty RDD using PySpark.

Before we begin, let us understand what are RDDs? Resilient Distributed Datasets (RDD) are a core data structure in PySpark. They are an immutable distributed collection of objects. Each dataset in RDD is separated into logical partitions that can be computed on multiple cluster nodes.

Build Log Analytics Application with Spark Streaming and Kafka  


Let us now parallelize an existing collection in your driver software with PySpark.

Here's an example of how to make an RDD with Sparkcontext's parallelize method.

sparkContext.parallelize([1,2,3,4,5,6,7,8])

Let us now use sparkContext.parallelize in a spark application –

Code:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ParallelizeExample').getOrCreate()
sparkContext=spark.sparkContext

rdd=sparkContext.parallelize([1,2,3,4,5,6,7])
rddCollect = rdd.collect()
print("Number of Partitions: "+str(rdd.getNumPartitions()))
print("Action: First element: "+str(rdd.first()))
print(rddCollect)

Output:
Number of Partitions: 2
Action: First element: 1
[1, 2, 3, 4, 5, 6, 7]

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

SQL Project for Data Analysis using Oracle Database-Part 3
In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

Learn to Create Delta Live Tables in Azure Databricks
In this Microsoft Azure Project, you will learn how to create delta live tables in Azure Databricks.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

AWS Snowflake Data Pipeline Example using Kinesis and Airflow
Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

COVID-19 Data Analysis Project using Python and AWS Stack
COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

PySpark Project to Learn Advanced DataFrame Concepts
In this PySpark Big Data Project, you will gain hands-on experience working with advanced functionalities of PySpark Dataframes and Performance Optimization.