Explain Count Distinct from Dataframe in PySpark in Databricks

This recipe explains what Count Distinct from Dataframe in PySpark in Databricks
Last Updated: 19 Jan 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain Count Distinct from Dataframe in PySpark in Databricks?

The distinct().count() of DataFrame or countDistinct() SQL function in Apache Spark are popularly used to get count distinct. The Distinct() is defined to eliminate the duplicate records(i.e., matching all the columns of the Row) from the DataFrame, and the count() returns the count of the records on the DataFrame. So, after chaining all these, the count distinct of the PySpark DataFrame is obtained. The countDistinct() is defined as the SQL function in PySpark, which could be further used to get the count distinct of the selected columns.a

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective - Explain Count Distinct from Dataframe in PySpark in Databricks?
- System Requirements
- Implementing the Count Distinct from DataFrame in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains Count Distinct from Dataframe and how to perform them in PySpark.

Implementing the Count Distinct from DataFrame in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession from pyspark.sql.functions import countDistinct Databricks-1

The Sparksession and countDistinct packages are imported to demonstrate Count Distinct from Dataframe in PySpark.

# Implementing the Count Distinct from DataFrame in Databricks in PySpark spark = SparkSession.builder \ .appName('Spark Count Distinct') \ .getOrCreate() Sample_data = [("Ram", "Technology", 4000), ("Shyam", "Technology", 5600), ("Veer", "Technology", 5100), ("Renu", "Accounts", 4000), ("Ram", "Technology", 4000), ("Vijay", "Accounts", 4300), ("Shivani", "Accounts", 4900), ("Amit", "Sales", 4000), ("Anupam", "Sales", 3000), ("Anas", "Technology", 5100) ] Sample_columns = ["Name","Dept","Salary"] dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_columns) dataframe.show() # Using distinct().count() function print("Distinct Count: " + str(dataframe.distinct().count())) # Using countDistinct() function dataframe2 = dataframe.select(countDistinct("Dept", "salary")) dataframe2.show() Databricks-2
Databricks-3

The "dataframe" value is created in which the Sample_data and Sample_columns are defined—using the distinct(). Count () function returns the number of rows that don't have any duplicate values. The countDistinct() SQL function in PySpark returns the count distinct on the selected columns like Dept and Salary the dataframe.

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks

In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

View Project Details

Snowflake Azure Project to build real-time Twitter feed dashboard

In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.

View Project Details

Build a Streaming Pipeline with DBT, Snowflake and Kinesis

This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 2

In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

View Project Details

PySpark Project to Learn Advanced DataFrame Concepts

In this PySpark Big Data Project, you will gain hands-on experience working with advanced functionalities of PySpark Dataframes and Performance Optimization.

View Project Details

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB

Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

View Project Details

Build an Analytical Platform for eCommerce using AWS Services

In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

View Project Details

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack

In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

View Project Details

Learn Real-Time Data Ingestion with Azure Purview

In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

View Project Details

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

View Project Details

Explain Count Distinct from Dataframe in PySpark in Databricks

Recipe Objective - Explain Count Distinct from Dataframe in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the Count Distinct from DataFrame in Databricks in PySpark

Ameeruddin Mohammed

Relevant Projects

You might also like

Relevant Projects