Explain Count Distinct from Dataframe in PySpark in Databricks

This recipe explains what Count Distinct from Dataframe in PySpark in Databricks

Recipe Objective - Explain Count Distinct from Dataframe in PySpark in Databricks?

The distinct().count() of DataFrame or countDistinct() SQL function in Apache Spark are popularly used to get count distinct. The Distinct() is defined to eliminate the duplicate records(i.e., matching all the columns of the Row) from the DataFrame, and the count() returns the count of the records on the DataFrame. So, after chaining all these, the count distinct of the PySpark DataFrame is obtained. The countDistinct() is defined as the SQL function in PySpark, which could be further used to get the count distinct of the selected columns.a

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains Count Distinct from Dataframe and how to perform them in PySpark.

Implementing the Count Distinct from DataFrame in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct
Databricks-1

The Sparksession and countDistinct packages are imported to demonstrate Count Distinct from Dataframe in PySpark.

# Implementing the Count Distinct from DataFrame in Databricks in PySpark
spark = SparkSession.builder \
.appName('Spark Count Distinct') \
.getOrCreate()
Sample_data = [("Ram", "Technology", 4000),
("Shyam", "Technology", 5600),
("Veer", "Technology", 5100),
("Renu", "Accounts", 4000),
("Ram", "Technology", 4000),
("Vijay", "Accounts", 4300),
("Shivani", "Accounts", 4900),
("Amit", "Sales", 4000),
("Anupam", "Sales", 3000),
("Anas", "Technology", 5100)
]
Sample_columns = ["Name","Dept","Salary"]
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_columns)
dataframe.show()
# Using distinct().count() function
print("Distinct Count: " + str(dataframe.distinct().count()))
# Using countDistinct() function
dataframe2 = dataframe.select(countDistinct("Dept", "salary"))
dataframe2.show()
Databricks-2

Databricks-3

The "dataframe" value is created in which the Sample_data and Sample_columns are defined—using the distinct(). Count () function returns the number of rows that don't have any duplicate values. The countDistinct() SQL function in PySpark returns the count distinct on the selected columns like Dept and Salary the dataframe.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Snowflake Azure Project to build real-time Twitter feed dashboard
In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

PySpark Project to Learn Advanced DataFrame Concepts
In this PySpark Big Data Project, you will gain hands-on experience working with advanced functionalities of PySpark Dataframes and Performance Optimization.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

Learn Real-Time Data Ingestion with Azure Purview
In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.