Explain the orderBy and sort functions in PySpark in Databricks

This recipe explains what the orderBy and sort functions in PySpark in Databricks
Last Updated: 11 May 2023

Get access to Big Data projects View all Big Data projects

PYSPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the orderBy() and sort() functions in PySpark in Databricks?

In PySpark, the DataFrame class provides a sort() function which is defined to sort on one or more columns and it sorts by ascending order by default. The PySpark DataFrame also provides the orderBy() function to sort on one or more columns. and it orders by ascending by default. Both the functions sort() or orderBy() of the PySpark DataFrame are used to sort the DataFrame by ascending or descending order based on the single or multiple columns. In PySpark, the Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

Recipe Objective - Explain the orderBy() and sort() functions in PySpark in Databricks?
- System Requirements
- Implementing the orderBy() and sort() functions in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is orderBy() and sort() functions and explains their usage in PySpark.

Implementing the orderBy() and sort() functions in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession, Row from pyspark.sql.functions import col, asc, desc Databricks-1

The Sparksession, Row, col, asc and desc are imported in the environment to use orderBy() and sort() functions in the PySpark.

# Implementing the orderBy() and sort() functions in Databricks in PySpark spark = SparkSession.builder.appName('orderby() and sort() PySpark').getOrCreate() sample_data = [("Ram","Sales","Dl",80000,24,90000), \ ("Shyam","Sales","DL",76000,46,10000), \ ("Amit","Sales","RJ",71000,20,13000), \ ("Pooja","Finance","RJ",80000,14,13000), \ ("Raman","Finance","RJ",89000,30,14000), \ ("Anoop","Finance","DL",73000,46,29000), \ ("Rahul","Finance","DL",89000,63,25000), \ ("Raju","Marketing","RJ",90000,35,28000), \ ("Pappu","Marketing","DL",81000,40,11000) \ ] sample_columns= ["employee_name","department","state","salary","age","bonus"] dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns) dataframe.printSchema() dataframe.show(truncate=False) # Using sort() function dataframe.sort("department","state").show(truncate=False) dataframe.sort(col("department"),col("state")).show(truncate=False) # Using orderBy() function dataframe.orderBy("department","state").show(truncate=False) dataframe.orderBy(col("department"),col("state")).show(truncate=False) # Using sort() function by Ascending dataframe.sort(dataframe.department.asc(), dataframe.state.asc()).show(truncate=False) dataframe.sort(col("department").asc(),col("state").asc()).show(truncate=False) dataframe.orderBy(col("department").asc(),col("state").asc()).show(truncate=False) # Using sort() function by Descending dataframe.sort(dataframe.department.asc(), dataframe.state.desc()).show(truncate=False) dataframe.sort(col("department").asc(),col("state").desc()).show(truncate=False) dataframe.orderBy(col("department").asc(),col("state").desc()).show(truncate=False) Databricks-2
Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-7
Databricks-8
Databricks-9
Databricks-10
Databricks-11
Databricks-12
Databricks-13
Databricks-14
Databricks-15

The Spark Session is defined. The "sample_data" and "sample_columns" are defined. Further, the DataFrame "dataframedata framened using the sample data and sample columns. Using the sort() function, the first statement takes the DataFrame column name as the string and the next takes columns in Column type and the output table is sorted by the first department column and then state column. Using rh orderBy() function, the first statement takes the DataFrame column name as the string and next take the columns in Column type and the output table is sorted by the first department column and then state column. Further, sort() by ascending method of the column function. Also, the sort() by descending method of the column function.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

Databricks_6

Databricks_7

Databricks_8

Databricks_9

Databricks_10

Databricks_11

Databricks_12

Databricks_13

Databricks_14

Databricks_15

What Users are saying..

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Explain the orderBy and sort functions in PySpark in Databricks

Recipe Objective - Explain the orderBy() and sort() functions in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the orderBy() and sort() functions in Databricks in PySpark

Anand Kumpatla

Relevant Projects

You might also like

Relevant Projects