Explain the orderBy and sort functions in PySpark in Databricks

This recipe explains what the orderBy and sort functions in PySpark in Databricks

Recipe Objective - Explain the orderBy() and sort() functions in PySpark in Databricks?

In PySpark, the DataFrame class provides a sort() function which is defined to sort on one or more columns and it sorts by ascending order by default. The PySpark DataFrame also provides the orderBy() function to sort on one or more columns. and it orders by ascending by default. Both the functions sort() or orderBy() of the PySpark DataFrame are used to sort the DataFrame by ascending or descending order based on the single or multiple columns. In PySpark, the Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is orderBy() and sort() functions and explains their usage in PySpark.

Implementing the orderBy() and sort() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import col, asc, desc
Databricks-1

The Sparksession, Row, col, asc and desc are imported in the environment to use orderBy() and sort() functions in the PySpark.

# Implementing the orderBy() and sort() functions in Databricks in PySpark
spark = SparkSession.builder.appName('orderby() and sort() PySpark').getOrCreate()
sample_data = [("Ram","Sales","Dl",80000,24,90000), \
("Shyam","Sales","DL",76000,46,10000), \
("Amit","Sales","RJ",71000,20,13000), \
("Pooja","Finance","RJ",80000,14,13000), \
("Raman","Finance","RJ",89000,30,14000), \
("Anoop","Finance","DL",73000,46,29000), \
("Rahul","Finance","DL",89000,63,25000), \
("Raju","Marketing","RJ",90000,35,28000), \
("Pappu","Marketing","DL",81000,40,11000) \
]
sample_columns= ["employee_name","department","state","salary","age","bonus"]
dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using sort() function
dataframe.sort("department","state").show(truncate=False)
dataframe.sort(col("department"),col("state")).show(truncate=False)
# Using orderBy() function
dataframe.orderBy("department","state").show(truncate=False)
dataframe.orderBy(col("department"),col("state")).show(truncate=False)
# Using sort() function by Ascending
dataframe.sort(dataframe.department.asc(), dataframe.state.asc()).show(truncate=False)
dataframe.sort(col("department").asc(),col("state").asc()).show(truncate=False)
dataframe.orderBy(col("department").asc(),col("state").asc()).show(truncate=False)
# Using sort() function by Descending
dataframe.sort(dataframe.department.asc(), dataframe.state.desc()).show(truncate=False)
dataframe.sort(col("department").asc(),col("state").desc()).show(truncate=False)
dataframe.orderBy(col("department").asc(),col("state").desc()).show(truncate=False)
Databricks-2

Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-7
Databricks-8
Databricks-9
Databricks-10
Databricks-11
Databricks-12
Databricks-13
Databricks-14
Databricks-15

The Spark Session is defined. The "sample_data" and "sample_columns" are defined. Further, the DataFrame "dataframedata framened using the sample data and sample columns. Using the sort() function, the first statement takes the DataFrame column name as the string and the next takes columns in Column type and the output table is sorted by the first department column and then state column. Using rh orderBy() function, the first statement takes the DataFrame column name as the string and next take the columns in Column type and the output table is sorted by the first department column and then state column. Further, sort() by ascending method of the column function. Also, the sort() by descending method of the column function.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Databricks Real-Time Streaming with Event Hubs and Snowflake
In this Azure Databricks Project, you will learn to use Azure Databricks, Event Hubs, and Snowflake to process and analyze real-time data, specifically in monitoring IoT devices.

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

SQL Project for Data Analysis using Oracle Database-Part 3
In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

Hive Mini Project to Build a Data Warehouse for e-Commerce
In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Snowflake Azure Project to build real-time Twitter feed dashboard
In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.