Explain the pivot function and stack function in PySpark in Databricks

This recipe explains what the pivot function and stack function in PySpark in Databricks

Recipe Objective - Explain the pivot() function and stack() function in PySpark in Databricks?

In PySpark, the pivot() function is defined as the most important function and used to rotate or transpose the data from one column into the multiple Dataframe columns and back using the unpivot() function. The Pivot() function is an aggregation where one of the grouping columns values is transposed into the individual columns with the distinct data. The PySpark Dataframe is a distributed collection of the data organized into the named columns and is conceptually equivalent to the table in the relational database or the data frame in Python or R language. The Dataframes in PySpark can also be constructed from a wide array of the sources such as the structured data files, the tables in Apache Hive, External databases or the existing Resilient Distributed Datasets. Further, the unpivot is the reverse operation is achieved by rotating column values into the values of the row. The PySpark SQL doesn’t have the unpivot function hence the stack() function is used.

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is Pivot() function, Stack() function and explaining the usage of Pivot() and Stack() in PySpark.

Implementing the Pivot() function and Stack() function in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col, expr
from pyspark.sql.types import StructType,StructField, StringType
Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment so as to use pivot() function and stack() function in PySpark .

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

# Implementing the Pivot() function and Stack() function in Databricks in PySpark
spark = SparkSession.builder.appName('Pivot()Stack() PySpark').getOrCreate()
sample_data = [("Banana",1000,"USA"), ("Carrots",1500,"USA"), ("Beans",1600,"USA"), \
("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",400,"China"), \
("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",4000,"China"), \
("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico")]
sample_columns= ["Product","Amount","Country"]
dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using the pivot() function
pivot_DataFrame = dataframe.groupBy("Product").pivot("Country").sum("Amount")
pivot_DataFrame.printSchema()
pivot_DataFrame.show(truncate=False)
pivot_DataFrame2 = dataframe.groupBy("Product","Country") \
.sum("Amount") \
.groupBy("Product") \
.pivot("Country") \
.sum("sum(Amount)")
pivot_DataFrame2.printSchema()
pivot_DataFrame2.show(truncate=False)
# Using stack() function to unpivot
unpivot_Expr = "stack(3, 'Canada', Canada, 'China', China, 'Mexico', Mexico) as (Country,Total)"
unpivot_DataFrame = pivot_DataFrame2.select("Product", expr(unpivot_Expr)) \
.where("Total is not null")
unpivot_DataFrame.show(truncate=False)
Databricks-2

Databricks-3
Databricks-4
Databricks-5
Databricks-6

Learn to Transform your data pipeline with Azure Data Factory!

The Spark Session is defined. The "data frame" is defined using the sample_data and the sample_columns. Further, the "pivot_DataFrame" is defined to get the total amount exported to each country of each product and will do group by Product, pivot by Country, and the sum of Amount. Also, this will transpose the countries from the DataFrame rows into the columns and it represents as null by default. Using the stack() function to do unpivot operation, it converts the pivoted column “country” to the rows.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

Build an ETL Pipeline with Talend for Export of Data from Cloud
In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Build an AWS ETL Data Pipeline in Python on YouTube Data
AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda

PySpark Project-Build a Data Pipeline using Kafka and Redshift
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Apache Kafka and AWS Redshift

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM
Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

Getting Started with Pyspark on AWS EMR and Athena
In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.

Learn How to Implement SCD in Talend to Capture Data Changes
In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.