Explain conversion of dataframe columns to MapType in PySpark

The recipe gives a detailed overview of how create_map() function in Apache Spark is used for the Conversion of DataFrame Columns into MapType in PySpark in DataBricks, also the implementation of these function is shown with a example in Python.

Recipe Objective - Explain the conversion of Dataframe columns to MapType in PySpark in Databricks?

The create_map() function in Apache Spark is popularly used to convert the selected or all the DataFrame columns to the MapType, similar to the Python Dictionary (Dict) object. The create_map(column) function takes input as the list of columns grouped as the key-value pairs (key1, value1, key2, value2, key3, value3…) and which has to be converted using the function. The create_map() function returns the MapType column. The create_map() function is the PySpark SQL function which is imported from the "pyspark.sql.functions".

Access Movie Review Sentiment Analysis Project with Source Code

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains the create_map() function and how to perform them in PySpark.

Implementing the conversion of Dataframe columns to MapType in Databricks in PySpark

# Importing package
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col,lit,create_map
Databricks-1

The Sparksession, StructType, StructField, StringType, IntegerType, col, lit, and create_map packages are imported in the environment to perform conversion of Dataframe columns to MapType functions in PySpark.

# Implementing the conversion of Dataframe columns to MapType in Databricks in PySpark
spark = SparkSession.builder.appName('PySpark create_map()').getOrCreate()
Sample_data = [ ("38874","Technology",5000,"IND"),
("42105","Technology",6000,"BHU"),
("46987","Finance",4900,"IND"),
("35412","Entertainment",3500,"ISR"),
("36987","Finance",5500,"IND") ]
Sample_schema = StructType([
StructField('id', StringType(), True),
StructField('dept', StringType(), True),
StructField('salary', IntegerType(), True),
StructField('location', StringType(), True)
])
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
#Convert columns to Map
dataframe = dataframe.withColumn("PropertiesOnMap",create_map(
lit("salary"),col("salary"),
lit("location"),col("location")
)).drop("salary","location")
dataframe.printSchema()
dataframe.show(truncate=False)
Databricks-1

Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. The create_map() PySpark SQL function returns the converted DataFrame columns salary and location to the MapType.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

Build Streaming Data Pipeline using Azure Stream Analytics
In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

PySpark ETL Project for Real-Time Data Processing
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Project-Driven Approach to PySpark Partitioning Best Practices
In this Big Data Project, you will learn to implement PySpark Partitioning Best Practices.

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

Flask API Big Data Project using Databricks and Unity Catalog
In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

Hadoop Project to Perform Hive Analytics using SQL and Scala
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.