Explain mapvalues and mapkeys function in PySpark in Databricks

This recipe explains what mapvalues and mapkeys function in PySpark in Databricks

Recipe Objective - Explain map_values() and map_keys() function in PySpark in Databricks?

The PySpark MapType (also called map type) in Apache Spark is popularly known as the data type, used to represent the Python Dictionary (dict) for storing the key-value pair. The MapType object comprises of the three fields which are key type (a DataType), valueType (a DataType) and the valueContainsNull (a BooleanType). The PySpark MapType represents the Map key-value pair similar to the python Dictionary (Dict). It extends the DataType class, which is the superclass of all the types in the PySpark, which takes the two mandatory arguments: key type and value type of type DataType and one optional boolean argument that is valueContainsNull. The map_values() function is used to get all the map values. The map_keys() function is used to get all map keys.

ETL Orchestration on AWS using Glue and Step Functions

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what are PySpark MapType, map_values() function, map_keys() and how to perform them in PySpark.

Implementing the map_values() and map_keys() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, MapType
from pyspark.sql.types import StringType, MapType
from pyspark.sql.functions import map_values
from pyspark.sql.functions import map_keys
Databricks-1

The Sparksession, StructField, StructType, StringType, MapType, map_values and map_keys packages are imported in the environment so as to perform map_values() and map_keys() functions in PySpark.

# Implementing the map_values() and map_keys() functions in Databricks in PySpark
spark = SparkSession.builder.appName('PySpark map_values() and map_keys()').getOrCreate()
Sample_schema = StructType([
StructField('name', StringType(), True),
StructField('properties', MapType(StringType(),StringType()),True)
])
Sample_dataDictionary = [
('Ram',{'hair':'brown','eye':'brown'}),
('Shyam',{'hair':'black','eye':'black'}),
('Raman',{'hair':'orange','eye':'black'}),
('Sonu',{'hair':'red','eye':None}),
('Vinay',{'hair':'black','eye':''})
]
dataframe = spark.createDataFrame(data = Sample_dataDictionary, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using map_values() function
dataframe.select(dataframe.name, map_values(dataframe.properties)).show()
# Using map_keys() function
dataframe.select(dataframe.name, map_keys(dataframe.properties)).show()
Databricks-2

Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_dataDictionary and Sample_schema are defined. Using the map_values() PySpark function returns the map values of all the dataframe properties present in the dataframe. The map_keys() PySpark function returns the map keys of all the dataframe properties current in the dataframe.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Big Data Project for Solving Small File Problem in Hadoop Spark
This big data project focuses on solving the small file problem to optimize data processing efficiency by leveraging Apache Hadoop and Spark within AWS EMR by implementing and demonstrating effective techniques for handling large numbers of small files.

Flask API Big Data Project using Databricks and Unity Catalog
In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

SQL Project for Data Analysis using Oracle Database-Part 5
In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis

Getting Started with Pyspark on AWS EMR and Athena
In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, you will learn how to process data using Spark and Hive as well as perform queries on Hive tables.

Build an ETL Pipeline with Talend for Export of Data from Cloud
In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.