Explain the distinct function and dropDuplicates function in PySpark in Databricks

This recipe explains what the distinct function and dropDuplicates function in PySpark in Databricks

Recipe Objective - Explain the distinct() function and dropDuplicates() function in PySpark in Databricks?

In PySpark, the distinct() function is widely used to drop or remove the duplicate rows or all columns from the DataFrame. The dropDuplicates() function is widely used to drop the rows based on the selected (one or multiple) columns. The Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

Learn to Transform your data pipeline with Azure Data Factory!

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what are distinct() and dropDuplicates() functions and explains their usage in PySpark.

Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
Databricks-1

The Sparksession, expr is imported in the environment to use distinct() function and dropDuplicates() functions in the PySpark.

# Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark
spark = SparkSession.builder.appName('distinct() and dropDuplicates() PySpark').getOrCreate()
sample_data = [("Ram", "Sales", 4000), \
("Shyam", "Sales", 5600), \
("Amit", "Sales", 5100), \
("Rahul", "Finance", 4000), \
("Raju", "Sales", 4000), \
("Ramu", "Finance", 4300), \
("Shamu", "Finance", 4900), \
("Kaushik", "Marketing", 4000), \
("Sagar", "Marketing", 3000), \
("Prakash", "Sales", 3100) \
]
sample_columns= ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
#Using Distinct on Dataframe
distinct_DataFrame = dataframe.distinct()
print("Distinct count: "+str(distinct_DataFrame.count()))
distinct_DataFrame.show(truncate=False)
# Using dropDuplicates() function
dataframe2 = dataframe.dropDuplicates()
print("Distinct count: "+str(dataframe2.count()))
dataframe2.show(truncate=False)
#Drop duplicates on selected columns
dropDis_Dataframe = dataframe.dropDuplicates(["department","salary"])
print("Distinct count of the department salary : "+str(dropDis_Dataframe.count()))
dropDis_Dataframe.show(truncate=False)
Databricks-2

Databricks-3
Databricks-4

The Spark Session is defined. The "sample_data" and "sample_columns" are defined. Further, the DataFrame "data frame" is defined using the sample data and sample columns. The distinct() function on DataFrame returns the new DataFrame after removing the duplicate records. The dropDuplicates() function is used to create "dataframe2" and the output is displayed using the show() function. The dropDuplicates() function is executed on selected columns.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

SQL Project for Data Analysis using Oracle Database-Part 1
In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

Learn to Create Delta Live Tables in Azure Databricks
In this Microsoft Azure Project, you will learn how to create delta live tables in Azure Databricks.

Learn Real-Time Data Ingestion with Azure Purview
In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

Learn to Build Regression Models with PySpark and Spark MLlib
In this PySpark Project, you will learn to implement regression machine learning models in SparkMLlib.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.