Explain rank and rownumber window function in PySpark

This tutorial gives an detailed Explanation for Rank and Rownumber window function in PySpark in Databricks also how these functions are utilized for day to day operations in Python is given in this particular tutorial.

Recipe Objective - Explain rank and row_number window functions in PySpark in Databricks?

The row_number() function and the rank() function in PySpark is popularly used for day-to-day operations and make the difficult task an easy way. The rank() function is used to provide the rank to the result within the window partition, and this function also leaves gaps in position when there are ties. The row_number() function is defined as which gives the sequential row number starting from the 1 to the result of each window partition.

Build Log Analytics Application with Spark Streaming and Kafka

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what rank and row_number window function and how to perform them in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

Implementing the rank and row_number window functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
from pyspark.sql.functions import row_number
Databricks-1

The Sparksession, Window, rank and row_number packages are imported in the environment to demonstrate status and row_number window functions in PySpark.

# Implementing therank and row_number window functions in Databricks in PySpark
spark = SparkSession.builder.appName('Spark rank() row_number()').getOrCreate()
Sample_data = [("Ram", "Technology", 4000),
("Shyam", "Technology", 5600),
("Veer", "Technology", 5100),
("Renu", "Accounts", 4000),
("Ram", "Technology", 4000),
("Vijay", "Accounts", 4300),
("Shivani", "Accounts", 4900),
("Amit", "Sales", 4000),
("Anupam", "Sales", 3000),
("Anas", "Technology", 5100)
]
Sample_columns= ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
# Defining row_number() function
Window_Spec = Window.partitionBy("department").orderBy("salary")
dataframe.withColumn("row_number",row_number().over(Window_Spec)) \
.show(truncate=False)
# Defining rank() function
dataframe.withColumn("rank",rank().over(Window_Spec)) \
.show()
Databricks-2

Databricks-3
Databricks-4
Databricks-5

The "dataframe" value is created in which the Sample_data and Sample_columns are defined. The row_number() function returns the sequential row number starting from the 1 to the result of each window partition. The rank() function in PySpark returns the rank to the development within the window partition. So, this function leaves gaps in the class when there are ties.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Getting Started with Pyspark on AWS EMR and Athena
In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.

EMR Serverless Example to Build a Search Engine for COVID19
In this AWS Project, create a search engine using the BM25 TF-IDF Algorithm that uses EMR Serverless for ad-hoc processing of a large amount of unstructured textual data.

SQL Project for Data Analysis using Oracle Database-Part 4
In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

COVID-19 Data Analysis Project using Python and AWS Stack
COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.