Explain Count Distinct from Dataframe in PySpark in Databricks

This recipe explains what Count Distinct from Dataframe in PySpark in Databricks

Recipe Objective - Explain Count Distinct from Dataframe in PySpark in Databricks?

The distinct().count() of DataFrame or countDistinct() SQL function in Apache Spark are popularly used to get count distinct. The Distinct() is defined to eliminate the duplicate records(i.e., matching all the columns of the Row) from the DataFrame, and the count() returns the count of the records on the DataFrame. So, after chaining all these, the count distinct of the PySpark DataFrame is obtained. The countDistinct() is defined as the SQL function in PySpark, which could be further used to get the count distinct of the selected columns.a

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains Count Distinct from Dataframe and how to perform them in PySpark.

Implementing the Count Distinct from DataFrame in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct
Databricks-1

The Sparksession and countDistinct packages are imported to demonstrate Count Distinct from Dataframe in PySpark.

# Implementing the Count Distinct from DataFrame in Databricks in PySpark
spark = SparkSession.builder \
.appName('Spark Count Distinct') \
.getOrCreate()
Sample_data = [("Ram", "Technology", 4000),
("Shyam", "Technology", 5600),
("Veer", "Technology", 5100),
("Renu", "Accounts", 4000),
("Ram", "Technology", 4000),
("Vijay", "Accounts", 4300),
("Shivani", "Accounts", 4900),
("Amit", "Sales", 4000),
("Anupam", "Sales", 3000),
("Anas", "Technology", 5100)
]
Sample_columns = ["Name","Dept","Salary"]
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_columns)
dataframe.show()
# Using distinct().count() function
print("Distinct Count: " + str(dataframe.distinct().count()))
# Using countDistinct() function
dataframe2 = dataframe.select(countDistinct("Dept", "salary"))
dataframe2.show()
Databricks-2

Databricks-3

The "dataframe" value is created in which the Sample_data and Sample_columns are defined—using the distinct(). Count () function returns the number of rows that don't have any duplicate values. The countDistinct() SQL function in PySpark returns the count distinct on the selected columns like Dept and Salary the dataframe.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

SQL Project for Data Analysis using Oracle Database-Part 1
In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

SQL Project for Data Analysis using Oracle Database-Part 5
In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

Build a Scalable Event Based GCP Data Pipeline using DataFlow
In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.