Explain the orderBy and sort functions in PySpark in Databricks

This recipe explains what the orderBy and sort functions in PySpark in Databricks

Recipe Objective - Explain the orderBy() and sort() functions in PySpark in Databricks?

In PySpark, the DataFrame class provides a sort() function which is defined to sort on one or more columns and it sorts by ascending order by default. The PySpark DataFrame also provides the orderBy() function to sort on one or more columns. and it orders by ascending by default. Both the functions sort() or orderBy() of the PySpark DataFrame are used to sort the DataFrame by ascending or descending order based on the single or multiple columns. In PySpark, the Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is orderBy() and sort() functions and explains their usage in PySpark.

Implementing the orderBy() and sort() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import col, asc, desc
Databricks-1

The Sparksession, Row, col, asc and desc are imported in the environment to use orderBy() and sort() functions in the PySpark.

# Implementing the orderBy() and sort() functions in Databricks in PySpark
spark = SparkSession.builder.appName('orderby() and sort() PySpark').getOrCreate()
sample_data = [("Ram","Sales","Dl",80000,24,90000), \
("Shyam","Sales","DL",76000,46,10000), \
("Amit","Sales","RJ",71000,20,13000), \
("Pooja","Finance","RJ",80000,14,13000), \
("Raman","Finance","RJ",89000,30,14000), \
("Anoop","Finance","DL",73000,46,29000), \
("Rahul","Finance","DL",89000,63,25000), \
("Raju","Marketing","RJ",90000,35,28000), \
("Pappu","Marketing","DL",81000,40,11000) \
]
sample_columns= ["employee_name","department","state","salary","age","bonus"]
dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
# Using sort() function
dataframe.sort("department","state").show(truncate=False)
dataframe.sort(col("department"),col("state")).show(truncate=False)
# Using orderBy() function
dataframe.orderBy("department","state").show(truncate=False)
dataframe.orderBy(col("department"),col("state")).show(truncate=False)
# Using sort() function by Ascending
dataframe.sort(dataframe.department.asc(), dataframe.state.asc()).show(truncate=False)
dataframe.sort(col("department").asc(),col("state").asc()).show(truncate=False)
dataframe.orderBy(col("department").asc(),col("state").asc()).show(truncate=False)
# Using sort() function by Descending
dataframe.sort(dataframe.department.asc(), dataframe.state.desc()).show(truncate=False)
dataframe.sort(col("department").asc(),col("state").desc()).show(truncate=False)
dataframe.orderBy(col("department").asc(),col("state").desc()).show(truncate=False)
Databricks-2

Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-7
Databricks-8
Databricks-9
Databricks-10
Databricks-11
Databricks-12
Databricks-13
Databricks-14
Databricks-15

The Spark Session is defined. The "sample_data" and "sample_columns" are defined. Further, the DataFrame "dataframedata framened using the sample data and sample columns. Using the sort() function, the first statement takes the DataFrame column name as the string and the next takes columns in Column type and the output table is sorted by the first department column and then state column. Using rh orderBy() function, the first statement takes the DataFrame column name as the string and next take the columns in Column type and the output table is sorted by the first department column and then state column. Further, sort() by ascending method of the column function. Also, the sort() by descending method of the column function.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Building Data Pipelines in Azure with Azure Synapse Analytics
In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

Build an ETL Pipeline with Talend for Export of Data from Cloud
In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Talend Real-Time Project for ETL Process Automation
In this Talend Project, you will learn how to build an ETL pipeline in Talend Open Studio to automate the process of File Loading and Processing.

SQL Project for Data Analysis using Oracle Database-Part 5
In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.

Build an ETL Pipeline with DBT, Snowflake and Airflow
Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

Data Processing and Transformation in Hive using Azure VM
Hive Practice Example - Explore hive usage efficiently for data transformation and processing in this big data project using Azure VM.

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.