Explain the distinct function and dropDuplicates function in PySpark in Databricks

This recipe explains what the distinct function and dropDuplicates function in PySpark in Databricks
Last Updated: 23 Jan 2023

Get access to Big Data projects View all Big Data projects

PYSPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the distinct() function and dropDuplicates() function in PySpark in Databricks?

In PySpark, the distinct() function is widely used to drop or remove the duplicate rows or all columns from the DataFrame. The dropDuplicates() function is widely used to drop the rows based on the selected (one or multiple) columns. The Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

Learn to Transform your data pipeline with Azure Data Factory!

Recipe Objective - Explain the distinct() function and dropDuplicates() function in PySpark in Databricks?
- System Requirements
- Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what are distinct() and dropDuplicates() functions and explains their usage in PySpark.

Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession from pyspark.sql.functions import expr Databricks-1

The Sparksession, expr is imported in the environment to use distinct() function and dropDuplicates() functions in the PySpark.

# Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark spark = SparkSession.builder.appName('distinct() and dropDuplicates() PySpark').getOrCreate() sample_data = [("Ram", "Sales", 4000), \ ("Shyam", "Sales", 5600), \ ("Amit", "Sales", 5100), \ ("Rahul", "Finance", 4000), \ ("Raju", "Sales", 4000), \ ("Ramu", "Finance", 4300), \ ("Shamu", "Finance", 4900), \ ("Kaushik", "Marketing", 4000), \ ("Sagar", "Marketing", 3000), \ ("Prakash", "Sales", 3100) \ ] sample_columns= ["employee_name", "department", "salary"] dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns) dataframe.printSchema() dataframe.show(truncate=False) #Using Distinct on Dataframe distinct_DataFrame = dataframe.distinct() print("Distinct count: "+str(distinct_DataFrame.count())) distinct_DataFrame.show(truncate=False) # Using dropDuplicates() function dataframe2 = dataframe.dropDuplicates() print("Distinct count: "+str(dataframe2.count())) dataframe2.show(truncate=False) #Drop duplicates on selected columns dropDis_Dataframe = dataframe.dropDuplicates(["department","salary"]) print("Distinct count of the department salary : "+str(dropDis_Dataframe.count())) dropDis_Dataframe.show(truncate=False) Databricks-2
Databricks-3
Databricks-4

The Spark Session is defined. The "sample_data" and "sample_columns" are defined. Further, the DataFrame "data frame" is defined using the sample data and sample columns. The distinct() function on DataFrame returns the new DataFrame after removing the duplicate records. The dropDuplicates() function is used to create "dataframe2" and the output is displayed using the show() function. The dropDuplicates() function is executed on selected columns.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

Databricks_6

Databricks_7

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Explain the distinct function and dropDuplicates function in PySpark in Databricks

Recipe Objective - Explain the distinct() function and dropDuplicates() function in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark

Ameeruddin Mohammed

Relevant Projects

You might also like

Relevant Projects