Explain dense rank and percent rank window functions in pyspark

The tutorial gives an Overview of Dense rank and Percent rank Window functions in PySpark in Databricks also the difference between both functions is explained in this and how to implement these functions in Python is given.

Recipe Objective - Explain dense_rank and percent_rank window functions in PySpark in Databricks?

The dense_rank() function and the percent_rank() function in PySpark is popularly used for day-to-day operations and make the difficult task an easy way. The dense_rank() window function in PySpark is defined to be used to get the result with the rank of rows within the window partition without any gaps that is it is similar to the rank() function, just the difference being rank() function leaves gaps in rank when there are ties. The percent_rank() function in PySpark is defined to return the status of rows in the percentage format with windows specified.

System Requirements

This recipe explains what dense_rank and percent_rank window function and how to perform them in PySpark.

Implementing the dense_rank and percent_rank window functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
from pyspark.sql.functions import percent_rank
Databricks-1

The Sparksession, Window, dense_rank and percent_rank packages are imported in the environment to demonstrate dense_rank and percent_rank window functions in PySpark.

# Implementing the dense_rank and percent_rank window functions in Databricks in PySpark
spark = SparkSession.builder.appName('Spark rank() row_number()').getOrCreate()
Sample_data = [("Ram", "Technology", 4000),
("Shyam", "Technology", 5600),
("Veer", "Technology", 5100),
("Renu", "Accounts", 4000),
("Ram", "Technology", 4000),
("Vijay", "Accounts", 4300),
("Shivani", "Accounts", 4900),
("Amit", "Sales", 4000),
("Anupam", "Sales", 3000),
("Anas", "Technology", 5100)
]
Sample_columns= ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using dense_rank() function
Window_Spec = Window.partitionBy("department").orderBy("salary")
dataframe.withColumn("dense_rank",dense_rank().over(Window_Spec)) \
.show()
# Defining percent_rank() function
dataframe.withColumn("percent_rank",percent_rank().over(Window_Spec)) \
.show()
Databricks-2

Databricks-3
Databricks-4
Databricks-5

The "dataframe" value is created in which the Sample_data and Sample_columns are defined. The dense_rank() function returns the result with the rank of rows within a window partition that is the "Window_Spec" without any gaps. The percent_rank() function in PySpark is defined as returning the rank of rows in the percentage format "Window_Spec."

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

Databricks Real-Time Streaming with Event Hubs and Snowflake
In this Azure Databricks Project, you will learn to use Azure Databricks, Event Hubs, and Snowflake to process and analyze real-time data, specifically in monitoring IoT devices.

Hadoop Project to Perform Hive Analytics using SQL and Scala
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.

SQL Project for Data Analysis using Oracle Database-Part 7
In this SQL project, you will learn to perform various data wrangling activities on an ecommerce database.

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC
In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis