Explain the map transformation in PySpark in Databricks

This recipe explains what the map transformation in PySpark in Databricks

Recipe Objective - Explain the map() transformation in PySpark in Databricks?

In PySpark, the map (map()) is defined as the RDD transformation that is widely used to apply the transformation function (Lambda) on every element of Resilient Distributed Datasets(RDD) or DataFrame and further returns a new Resilient Distributed Dataset(RDD). The RDD map() transformation is also used to apply any complex operations like adding the column, updating the column, transforming data etc so the output of the map() transformations would always have the same number of records as the input. The DataFrame doesn’t have the map() transformation to use with the DataFrame, hence it is needed to convert the DataFrame to the RDD first. Further, If a heavy initialization is there, it is recommended to use the PySpark mapPartitions() transformation instead of the map() as with the mapPartitions() heavy initialization executes only once for each partition instead of every record.

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is the map() transformation and explains the usage of the map() in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

Implementing the map() transformation in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col
from pyspark.sql.types import StructType,StructField, StringType
Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment to use map() transformation in the PySpark.

# Implementing the map() transformation in Databricks in PySpark
spark = SparkSession.builder.master("local[1]") \
.appName("map() PySpark").getOrCreate()
Sample_data = ["Project","Narmada","Gandhi","Adventures",
"in","Gujarat","Project","Narmada","Adventures",
"in","Gujarat","Project","Narmada"]
Rdd = spark.sparkContext.parallelize(Sample_data)
# Using map() transformation
Rdd2 = Rdd.map(lambda x: (x,1))
for element in Rdd2.collect():
print(element)
Databricks-2

Databricks-3

The Spark Session is defined. The "Sample_data" is defined. Further, "RDD" is defined using the Sample_data. The new element is added with the value 1 for each element, the result of the RDD is the PairRDDFunctions which further contains the key-value pairs, word of type String as the Key and 1 of type Int as the value. So, PySpark DataFrame doesn’t have the map() transformation to apply the lambda function, when applying the custom transformation its need to convert the DataFrame to Resilient Distributed Dataset and then further apply the map() transformation.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Data Processing and Transformation in Hive using Azure VM
Hive Practice Example - Explore hive usage efficiently for data transformation and processing in this big data project using Azure VM.

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Learn Real-Time Data Ingestion with Azure Purview
In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

Build an ETL Pipeline with Talend for Export of Data from Cloud
In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

Build a Spark Streaming Pipeline with Synapse and CosmosDB
In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM
Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.