Explain the distinct function and dropDuplicates function in PySpark in Databricks

This recipe explains what the distinct function and dropDuplicates function in PySpark in Databricks

Recipe Objective - Explain the distinct() function and dropDuplicates() function in PySpark in Databricks?

In PySpark, the distinct() function is widely used to drop or remove the duplicate rows or all columns from the DataFrame. The dropDuplicates() function is widely used to drop the rows based on the selected (one or multiple) columns. The Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

Learn to Transform your data pipeline with Azure Data Factory!

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what are distinct() and dropDuplicates() functions and explains their usage in PySpark.

Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
Databricks-1

The Sparksession, expr is imported in the environment to use distinct() function and dropDuplicates() functions in the PySpark.

# Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark
spark = SparkSession.builder.appName('distinct() and dropDuplicates() PySpark').getOrCreate()
sample_data = [("Ram", "Sales", 4000), \
("Shyam", "Sales", 5600), \
("Amit", "Sales", 5100), \
("Rahul", "Finance", 4000), \
("Raju", "Sales", 4000), \
("Ramu", "Finance", 4300), \
("Shamu", "Finance", 4900), \
("Kaushik", "Marketing", 4000), \
("Sagar", "Marketing", 3000), \
("Prakash", "Sales", 3100) \
]
sample_columns= ["employee_name", "department", "salary"]
dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns)
dataframe.printSchema()
dataframe.show(truncate=False)
#Using Distinct on Dataframe
distinct_DataFrame = dataframe.distinct()
print("Distinct count: "+str(distinct_DataFrame.count()))
distinct_DataFrame.show(truncate=False)
# Using dropDuplicates() function
dataframe2 = dataframe.dropDuplicates()
print("Distinct count: "+str(dataframe2.count()))
dataframe2.show(truncate=False)
#Drop duplicates on selected columns
dropDis_Dataframe = dataframe.dropDuplicates(["department","salary"])
print("Distinct count of the department salary : "+str(dropDis_Dataframe.count()))
dropDis_Dataframe.show(truncate=False)
Databricks-2

Databricks-3
Databricks-4

The Spark Session is defined. The "sample_data" and "sample_columns" are defined. Further, the DataFrame "data frame" is defined using the sample data and sample columns. The distinct() function on DataFrame returns the new DataFrame after removing the duplicate records. The dropDuplicates() function is used to create "dataframe2" and the output is displayed using the show() function. The dropDuplicates() function is executed on selected columns.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

COVID-19 Data Analysis Project using Python and AWS Stack
COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

Create A Data Pipeline based on Messaging Using PySpark Hive
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Build Streaming Data Pipeline using Azure Stream Analytics
In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

Build a Scalable Event Based GCP Data Pipeline using DataFlow
In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

Learn Real-Time Data Ingestion with Azure Purview
In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

Build a Spark Streaming Pipeline with Synapse and CosmosDB
In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

Learn to Create Delta Live Tables in Azure Databricks
In this Microsoft Azure Project, you will learn how to create delta live tables in Azure Databricks.