Explain mapvalues and mapkeys function in PySpark in Databricks

This recipe explains what mapvalues and mapkeys function in PySpark in Databricks

Recipe Objective - Explain map_values() and map_keys() function in PySpark in Databricks?

The PySpark MapType (also called map type) in Apache Spark is popularly known as the data type, used to represent the Python Dictionary (dict) for storing the key-value pair. The MapType object comprises of the three fields which are key type (a DataType), valueType (a DataType) and the valueContainsNull (a BooleanType). The PySpark MapType represents the Map key-value pair similar to the python Dictionary (Dict). It extends the DataType class, which is the superclass of all the types in the PySpark, which takes the two mandatory arguments: key type and value type of type DataType and one optional boolean argument that is valueContainsNull. The map_values() function is used to get all the map values. The map_keys() function is used to get all map keys.

ETL Orchestration on AWS using Glue and Step Functions

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what are PySpark MapType, map_values() function, map_keys() and how to perform them in PySpark.

Implementing the map_values() and map_keys() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, MapType
from pyspark.sql.types import StringType, MapType
from pyspark.sql.functions import map_values
from pyspark.sql.functions import map_keys
Databricks-1

The Sparksession, StructField, StructType, StringType, MapType, map_values and map_keys packages are imported in the environment so as to perform map_values() and map_keys() functions in PySpark.

# Implementing the map_values() and map_keys() functions in Databricks in PySpark
spark = SparkSession.builder.appName('PySpark map_values() and map_keys()').getOrCreate()
Sample_schema = StructType([
StructField('name', StringType(), True),
StructField('properties', MapType(StringType(),StringType()),True)
])
Sample_dataDictionary = [
('Ram',{'hair':'brown','eye':'brown'}),
('Shyam',{'hair':'black','eye':'black'}),
('Raman',{'hair':'orange','eye':'black'}),
('Sonu',{'hair':'red','eye':None}),
('Vinay',{'hair':'black','eye':''})
]
dataframe = spark.createDataFrame(data = Sample_dataDictionary, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using map_values() function
dataframe.select(dataframe.name, map_values(dataframe.properties)).show()
# Using map_keys() function
dataframe.select(dataframe.name, map_keys(dataframe.properties)).show()
Databricks-2

Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_dataDictionary and Sample_schema are defined. Using the map_values() PySpark function returns the map values of all the dataframe properties present in the dataframe. The map_keys() PySpark function returns the map keys of all the dataframe properties current in the dataframe.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Build an ETL Pipeline with DBT, Snowflake and Airflow
Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

Build an ETL Pipeline with Talend for Export of Data from Cloud
In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

SQL Project for Data Analysis using Oracle Database-Part 3
In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.

Learn Data Processing with Spark SQL using Scala on AWS
In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

Build an AWS ETL Data Pipeline in Python on YouTube Data
AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda

Build a Scalable Event Based GCP Data Pipeline using DataFlow
In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, you will learn how to process data using Spark and Hive as well as perform queries on Hive tables.

PySpark Project to Learn Advanced DataFrame Concepts
In this PySpark Big Data Project, you will gain hands-on experience working with advanced functionalities of PySpark Dataframes and Performance Optimization.