Explain dense rank and percent rank window functions in pyspark

The tutorial gives an Overview of Dense rank and Percent rank Window functions in PySpark in Databricks also the difference between both functions is explained in this and how to implement these functions in Python is given.

Recipe Objective - Explain dense_rank and percent_rank window functions in PySpark in Databricks?

The dense_rank() function and the percent_rank() function in PySpark is popularly used for day-to-day operations and make the difficult task an easy way. The dense_rank() window function in PySpark is defined to be used to get the result with the rank of rows within the window partition without any gaps that is it is similar to the rank() function, just the difference being rank() function leaves gaps in rank when there are ties. The percent_rank() function in PySpark is defined to return the status of rows in the percentage format with windows specified.

System Requirements

This recipe explains what dense_rank and percent_rank window function and how to perform them in PySpark.

Implementing the dense_rank and percent_rank window functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
from pyspark.sql.functions import percent_rank
Databricks-1

The Sparksession, Window, dense_rank and percent_rank packages are imported in the environment to demonstrate dense_rank and percent_rank window functions in PySpark.

# Implementing the dense_rank and percent_rank window functions in Databricks in PySpark
spark = SparkSession.builder.appName('Spark rank() row_number()').getOrCreate()
Sample_data = [("Ram", "Technology", 4000),
("Shyam", "Technology", 5600),
("Veer", "Technology", 5100),
("Renu", "Accounts", 4000),
("Ram", "Technology", 4000),
("Vijay", "Accounts", 4300),
("Shivani", "Accounts", 4900),
("Amit", "Sales", 4000),
("Anupam", "Sales", 3000),
("Anas", "Technology", 5100)
]
Sample_columns= ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using dense_rank() function
Window_Spec = Window.partitionBy("department").orderBy("salary")
dataframe.withColumn("dense_rank",dense_rank().over(Window_Spec)) \
.show()
# Defining percent_rank() function
dataframe.withColumn("percent_rank",percent_rank().over(Window_Spec)) \
.show()
Databricks-2

Databricks-3
Databricks-4
Databricks-5

The "dataframe" value is created in which the Sample_data and Sample_columns are defined. The dense_rank() function returns the result with the rank of rows within a window partition that is the "Window_Spec" without any gaps. The percent_rank() function in PySpark is defined as returning the rank of rows in the percentage format "Window_Spec."

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Build an ETL Pipeline with Talend for Export of Data from Cloud
In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

Getting Started with Azure Purview for Data Governance
In this Microsoft Azure Purview Project, you will learn how to consume the ingested data and perform analysis to find insights.

SQL Project for Data Analysis using Oracle Database-Part 4
In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

SQL Project for Data Analysis using Oracle Database-Part 1
In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Airline Dataset Analysis using PySpark GraphFrames in Python
In this PySpark project, you will perform airline dataset analysis using graphframes in Python to find structural motifs, the shortest route between cities, and rank airports with PageRank.

Learn Efficient Multi-Source Data Processing with Talend ETL
In this Talend ETL Project , you will create a multi-source ETL Pipeline to load data from multiple sources such as MySQL Database, Azure Database, and API to Snowflake cloud using Talend Jobs.

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.