Explain the distinct function and dropDuplicates function in PySpark in Databricks

This recipe explains what the distinct function and dropDuplicates function in PySpark in Databricks
Last Updated: 23 Jan 2023

Get access to Big Data projects View all Big Data projects

PYSPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the distinct() function and dropDuplicates() function in PySpark in Databricks?

In PySpark, the distinct() function is widely used to drop or remove the duplicate rows or all columns from the DataFrame. The dropDuplicates() function is widely used to drop the rows based on the selected (one or multiple) columns. The Apache PySpark Resilient Distributed Dataset(RDD) Transformations are defined as the spark operations that is when executed on the Resilient Distributed Datasets(RDD), it further results in the single or the multiple new defined RDD’s. As the RDD mostly are immutable so, the transformations always create the new RDD without updating an existing RDD so, which results in the creation of an RDD lineage. RDD Lineage is defined as the RDD operator graph or the RDD dependency graph. RDD Transformations are also defined as lazy operations that are none of the transformations get executed until an action is called from the user.

Learn to Transform your data pipeline with Azure Data Factory!

Recipe Objective - Explain the distinct() function and dropDuplicates() function in PySpark in Databricks?
- System Requirements
- Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what are distinct() and dropDuplicates() functions and explains their usage in PySpark.

Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession from pyspark.sql.functions import expr Databricks-1

The Sparksession, expr is imported in the environment to use distinct() function and dropDuplicates() functions in the PySpark.

# Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark spark = SparkSession.builder.appName('distinct() and dropDuplicates() PySpark').getOrCreate() sample_data = [("Ram", "Sales", 4000), \ ("Shyam", "Sales", 5600), \ ("Amit", "Sales", 5100), \ ("Rahul", "Finance", 4000), \ ("Raju", "Sales", 4000), \ ("Ramu", "Finance", 4300), \ ("Shamu", "Finance", 4900), \ ("Kaushik", "Marketing", 4000), \ ("Sagar", "Marketing", 3000), \ ("Prakash", "Sales", 3100) \ ] sample_columns= ["employee_name", "department", "salary"] dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns) dataframe.printSchema() dataframe.show(truncate=False) #Using Distinct on Dataframe distinct_DataFrame = dataframe.distinct() print("Distinct count: "+str(distinct_DataFrame.count())) distinct_DataFrame.show(truncate=False) # Using dropDuplicates() function dataframe2 = dataframe.dropDuplicates() print("Distinct count: "+str(dataframe2.count())) dataframe2.show(truncate=False) #Drop duplicates on selected columns dropDis_Dataframe = dataframe.dropDuplicates(["department","salary"]) print("Distinct count of the department salary : "+str(dropDis_Dataframe.count())) dropDis_Dataframe.show(truncate=False) Databricks-2
Databricks-3
Databricks-4

The Spark Session is defined. The "sample_data" and "sample_columns" are defined. Further, the DataFrame "data frame" is defined using the sample data and sample columns. The distinct() function on DataFrame returns the new DataFrame after removing the duplicate records. The dropDuplicates() function is used to create "dataframe2" and the output is displayed using the show() function. The dropDuplicates() function is executed on selected columns.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

Databricks_6

Databricks_7

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

COVID-19 Data Analysis Project using Python and AWS Stack

COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

View Project Details

Create A Data Pipeline based on Messaging Using PySpark Hive

In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

View Project Details

Build Streaming Data Pipeline using Azure Stream Analytics

In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

View Project Details

Build a Scalable Event Based GCP Data Pipeline using DataFlow

In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 6

In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

View Project Details

Hands-On Real Time PySpark Project for Beginners

In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

View Project Details

GCP Project to Learn using BigQuery for Exploring Data

Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

View Project Details

Learn Real-Time Data Ingestion with Azure Purview

In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

View Project Details

Build a Spark Streaming Pipeline with Synapse and CosmosDB

In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

View Project Details

Learn to Create Delta Live Tables in Azure Databricks

In this Microsoft Azure Project, you will learn how to create delta live tables in Azure Databricks.

View Project Details

Explain the distinct function and dropDuplicates function in PySpark in Databricks

Recipe Objective - Explain the distinct() function and dropDuplicates() function in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the distinct() and dropDuplicates() functions in Databricks in PySpark

Abhinav Agarwal

Relevant Projects

You might also like

Relevant Projects