Explain rank and rownumber window function in PySpark

This tutorial gives an detailed Explanation for Rank and Rownumber window function in PySpark in Databricks also how these functions are utilized for day to day operations in Python is given in this particular tutorial.

Recipe Objective - Explain rank and row_number window functions in PySpark in Databricks?

The row_number() function and the rank() function in PySpark is popularly used for day-to-day operations and make the difficult task an easy way. The rank() function is used to provide the rank to the result within the window partition, and this function also leaves gaps in position when there are ties. The row_number() function is defined as which gives the sequential row number starting from the 1 to the result of each window partition.

Build Log Analytics Application with Spark Streaming and Kafka

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what rank and row_number window function and how to perform them in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

Implementing the rank and row_number window functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
from pyspark.sql.functions import row_number
Databricks-1

The Sparksession, Window, rank and row_number packages are imported in the environment to demonstrate status and row_number window functions in PySpark.

# Implementing therank and row_number window functions in Databricks in PySpark
spark = SparkSession.builder.appName('Spark rank() row_number()').getOrCreate()
Sample_data = [("Ram", "Technology", 4000),
("Shyam", "Technology", 5600),
("Veer", "Technology", 5100),
("Renu", "Accounts", 4000),
("Ram", "Technology", 4000),
("Vijay", "Accounts", 4300),
("Shivani", "Accounts", 4900),
("Amit", "Sales", 4000),
("Anupam", "Sales", 3000),
("Anas", "Technology", 5100)
]
Sample_columns= ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
# Defining row_number() function
Window_Spec = Window.partitionBy("department").orderBy("salary")
dataframe.withColumn("row_number",row_number().over(Window_Spec)) \
.show(truncate=False)
# Defining rank() function
dataframe.withColumn("rank",rank().over(Window_Spec)) \
.show()
Databricks-2

Databricks-3
Databricks-4
Databricks-5

The "dataframe" value is created in which the Sample_data and Sample_columns are defined. The row_number() function returns the sequential row number starting from the 1 to the result of each window partition. The rank() function in PySpark returns the rank to the development within the window partition. So, this function leaves gaps in the class when there are ties.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

PySpark ETL Project for Real-Time Data Processing
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing

SQL Project for Data Analysis using Oracle Database-Part 1
In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

Getting Started with Pyspark on AWS EMR and Athena
In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.

How to deal with slowly changing dimensions using snowflake?
Implement Slowly Changing Dimensions using Snowflake Method - Build Type 1 and Type 2 SCD in Snowflake using the Stream and Task Functionalities

Build a Spark Streaming Pipeline with Synapse and CosmosDB
In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.