Explain the pivot function and stack function in PySpark in Databricks

This recipe explains what the pivot function and stack function in PySpark in Databricks

Recipe Objective - Explain the pivot() function and stack() function in PySpark in Databricks?

In PySpark, the pivot() function is defined as the most important function and used to rotate or transpose the data from one column into the multiple Dataframe columns and back using the unpivot() function. The Pivot() function is an aggregation where one of the grouping columns values is transposed into the individual columns with the distinct data. The PySpark Dataframe is a distributed collection of the data organized into the named columns and is conceptually equivalent to the table in the relational database or the data frame in Python or R language. The Dataframes in PySpark can also be constructed from a wide array of the sources such as the structured data files, the tables in Apache Hive, External databases or the existing Resilient Distributed Datasets. Further, the unpivot is the reverse operation is achieved by rotating column values into the values of the row. The PySpark SQL doesn’t have the unpivot function hence the stack() function is used.

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is Pivot() function, Stack() function and explaining the usage of Pivot() and Stack() in PySpark.

Implementing the Pivot() function and Stack() function in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col, expr
from pyspark.sql.types import StructType,StructField, StringType
Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment so as to use pivot() function and stack() function in PySpark .

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

# Implementing the Pivot() function and Stack() function in Databricks in PySpark
spark = SparkSession.builder.appName('Pivot()Stack() PySpark').getOrCreate()
sample_data = [("Banana",1000,"USA"), ("Carrots",1500,"USA"), ("Beans",1600,"USA"), \
("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",400,"China"), \
("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",4000,"China"), \
("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico")]
sample_columns= ["Product","Amount","Country"]
dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using the pivot() function
pivot_DataFrame = dataframe.groupBy("Product").pivot("Country").sum("Amount")
pivot_DataFrame.printSchema()
pivot_DataFrame.show(truncate=False)
pivot_DataFrame2 = dataframe.groupBy("Product","Country") \
.sum("Amount") \
.groupBy("Product") \
.pivot("Country") \
.sum("sum(Amount)")
pivot_DataFrame2.printSchema()
pivot_DataFrame2.show(truncate=False)
# Using stack() function to unpivot
unpivot_Expr = "stack(3, 'Canada', Canada, 'China', China, 'Mexico', Mexico) as (Country,Total)"
unpivot_DataFrame = pivot_DataFrame2.select("Product", expr(unpivot_Expr)) \
.where("Total is not null")
unpivot_DataFrame.show(truncate=False)
Databricks-2

Databricks-3
Databricks-4
Databricks-5
Databricks-6

Learn to Transform your data pipeline with Azure Data Factory!

The Spark Session is defined. The "data frame" is defined using the sample_data and the sample_columns. Further, the "pivot_DataFrame" is defined to get the total amount exported to each country of each product and will do group by Product, pivot by Country, and the sum of Amount. Also, this will transpose the countries from the DataFrame rows into the columns and it represents as null by default. Using the stack() function to do unpivot operation, it converts the pivoted column “country” to the rows.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Learn to Build Regression Models with PySpark and Spark MLlib
In this PySpark Project, you will learn to implement regression machine learning models in SparkMLlib.

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

Talend Real-Time Project for ETL Process Automation
In this Talend Project, you will learn how to build an ETL pipeline in Talend Open Studio to automate the process of File Loading and Processing.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

Hadoop Project to Perform Hive Analytics using SQL and Scala
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

SQL Project for Data Analysis using Oracle Database-Part 3
In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.