Explain conversion of dataframe columns to MapType in PySpark

The recipe gives a detailed overview of how create_map() function in Apache Spark is used for the Conversion of DataFrame Columns into MapType in PySpark in DataBricks, also the implementation of these function is shown with a example in Python.

Recipe Objective - Explain the conversion of Dataframe columns to MapType in PySpark in Databricks?

The create_map() function in Apache Spark is popularly used to convert the selected or all the DataFrame columns to the MapType, similar to the Python Dictionary (Dict) object. The create_map(column) function takes input as the list of columns grouped as the key-value pairs (key1, value1, key2, value2, key3, value3…) and which has to be converted using the function. The create_map() function returns the MapType column. The create_map() function is the PySpark SQL function which is imported from the "pyspark.sql.functions".

Access Movie Review Sentiment Analysis Project with Source Code

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains the create_map() function and how to perform them in PySpark.

Implementing the conversion of Dataframe columns to MapType in Databricks in PySpark

# Importing package
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col,lit,create_map
Databricks-1

The Sparksession, StructType, StructField, StringType, IntegerType, col, lit, and create_map packages are imported in the environment to perform conversion of Dataframe columns to MapType functions in PySpark.

# Implementing the conversion of Dataframe columns to MapType in Databricks in PySpark
spark = SparkSession.builder.appName('PySpark create_map()').getOrCreate()
Sample_data = [ ("38874","Technology",5000,"IND"),
("42105","Technology",6000,"BHU"),
("46987","Finance",4900,"IND"),
("35412","Entertainment",3500,"ISR"),
("36987","Finance",5500,"IND") ]
Sample_schema = StructType([
StructField('id', StringType(), True),
StructField('dept', StringType(), True),
StructField('salary', IntegerType(), True),
StructField('location', StringType(), True)
])
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
#Convert columns to Map
dataframe = dataframe.withColumn("PropertiesOnMap",create_map(
lit("salary"),col("salary"),
lit("location"),col("location")
)).drop("salary","location")
dataframe.printSchema()
dataframe.show(truncate=False)
Databricks-1

Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. The create_map() PySpark SQL function returns the converted DataFrame columns salary and location to the MapType.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

Databricks Data Lineage and Replication Management
Databricks Project on data lineage and replication management to help you optimize your data management practices | ProjectPro

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC
In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

Learn to Build Regression Models with PySpark and Spark MLlib
In this PySpark Project, you will learn to implement regression machine learning models in SparkMLlib.

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM
Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

Learn How to Implement SCD in Talend to Capture Data Changes
In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.