Explain the pivot function and stack function in PySpark in Databricks

This recipe explains what the pivot function and stack function in PySpark in Databricks
Last Updated: 23 Jan 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the pivot() function and stack() function in PySpark in Databricks?

In PySpark, the pivot() function is defined as the most important function and used to rotate or transpose the data from one column into the multiple Dataframe columns and back using the unpivot() function. The Pivot() function is an aggregation where one of the grouping columns values is transposed into the individual columns with the distinct data. The PySpark Dataframe is a distributed collection of the data organized into the named columns and is conceptually equivalent to the table in the relational database or the data frame in Python or R language. The Dataframes in PySpark can also be constructed from a wide array of the sources such as the structured data files, the tables in Apache Hive, External databases or the existing Resilient Distributed Datasets. Further, the unpivot is the reverse operation is achieved by rotating column values into the values of the row. The PySpark SQL doesn’t have the unpivot function hence the stack() function is used.

Learn Spark SQL for Relational Big Data Procesing

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is Pivot() function, Stack() function and explaining the usage of Pivot() and Stack() in PySpark.

Implementing the Pivot() function and Stack() function in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession, Row from pyspark.sql.types import MapType, StringType from pyspark.sql.functions import col, expr from pyspark.sql.types import StructType,StructField, StringType Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment so as to use pivot() function and stack() function in PySpark .

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

# Implementing the Pivot() function and Stack() function in Databricks in PySpark spark = SparkSession.builder.appName('Pivot()Stack() PySpark').getOrCreate() sample_data = [("Banana",1000,"USA"), ("Carrots",1500,"USA"), ("Beans",1600,"USA"), \ ("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",400,"China"), \ ("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",4000,"China"), \ ("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico")] sample_columns= ["Product","Amount","Country"] dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns) dataframe.printSchema() dataframe.show(truncate=False) # Using the pivot() function pivot_DataFrame = dataframe.groupBy("Product").pivot("Country").sum("Amount") pivot_DataFrame.printSchema() pivot_DataFrame.show(truncate=False) pivot_DataFrame2 = dataframe.groupBy("Product","Country") \ .sum("Amount") \ .groupBy("Product") \ .pivot("Country") \ .sum("sum(Amount)") pivot_DataFrame2.printSchema() pivot_DataFrame2.show(truncate=False) # Using stack() function to unpivot unpivot_Expr = "stack(3, 'Canada', Canada, 'China', China, 'Mexico', Mexico) as (Country,Total)" unpivot_DataFrame = pivot_DataFrame2.select("Product", expr(unpivot_Expr)) \ .where("Total is not null") unpivot_DataFrame.show(truncate=False) Databricks-2
Databricks-3
Databricks-4
Databricks-5
Databricks-6

Learn to Transform your data pipeline with Azure Data Factory!

The Spark Session is defined. The "data frame" is defined using the sample_data and the sample_columns. Further, the "pivot_DataFrame" is defined to get the total amount exported to each country of each product and will do group by Product, pivot by Country, and the sum of Amount. Also, this will transpose the countries from the DataFrame rows into the columns and it represents as null by default. Using the stack() function to do unpivot operation, it converts the pivoted column “country” to the rows.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

Databricks_6

smallzipcode

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More