Explain Count Distinct from Dataframe in PySpark in Databricks

This recipe explains what Count Distinct from Dataframe in PySpark in Databricks
Last Updated: 19 Jan 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain Count Distinct from Dataframe in PySpark in Databricks?

The distinct().count() of DataFrame or countDistinct() SQL function in Apache Spark are popularly used to get count distinct. The Distinct() is defined to eliminate the duplicate records(i.e., matching all the columns of the Row) from the DataFrame, and the count() returns the count of the records on the DataFrame. So, after chaining all these, the count distinct of the PySpark DataFrame is obtained. The countDistinct() is defined as the SQL function in PySpark, which could be further used to get the count distinct of the selected columns.a

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective - Explain Count Distinct from Dataframe in PySpark in Databricks?
- System Requirements
- Implementing the Count Distinct from DataFrame in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains Count Distinct from Dataframe and how to perform them in PySpark.

Implementing the Count Distinct from DataFrame in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession from pyspark.sql.functions import countDistinct Databricks-1

The Sparksession and countDistinct packages are imported to demonstrate Count Distinct from Dataframe in PySpark.

# Implementing the Count Distinct from DataFrame in Databricks in PySpark spark = SparkSession.builder \ .appName('Spark Count Distinct') \ .getOrCreate() Sample_data = [("Ram", "Technology", 4000), ("Shyam", "Technology", 5600), ("Veer", "Technology", 5100), ("Renu", "Accounts", 4000), ("Ram", "Technology", 4000), ("Vijay", "Accounts", 4300), ("Shivani", "Accounts", 4900), ("Amit", "Sales", 4000), ("Anupam", "Sales", 3000), ("Anas", "Technology", 5100) ] Sample_columns = ["Name","Dept","Salary"] dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_columns) dataframe.show() # Using distinct().count() function print("Distinct Count: " + str(dataframe.distinct().count())) # Using countDistinct() function dataframe2 = dataframe.select(countDistinct("Dept", "salary")) dataframe2.show() Databricks-2
Databricks-3

The "dataframe" value is created in which the Sample_data and Sample_columns are defined—using the distinct(). Count () function returns the number of rows that don't have any duplicate values. The countDistinct() SQL function in PySpark returns the count distinct on the selected columns like Dept and Salary the dataframe.

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

SQL Project for Data Analysis using Oracle Database-Part 1

In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 5

In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.

View Project Details

Python and MongoDB Project for Beginners with Source Code-Part 1

In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

View Project Details

Retail Analytics Project Example using Sqoop, HDFS, and Hive

This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

View Project Details

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB

Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

View Project Details

PySpark Project-Build a Data Pipeline using Hive and Cassandra

In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

View Project Details

Web Server Log Processing using Hadoop in Azure

In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

View Project Details

Build a Scalable Event Based GCP Data Pipeline using DataFlow

In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

View Project Details

Orchestrate Redshift ETL using AWS Glue and Step Functions

ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

View Project Details

Explore features of Spark SQL in practice on Spark 2.0

The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

View Project Details

Explain Count Distinct from Dataframe in PySpark in Databricks

Recipe Objective - Explain Count Distinct from Dataframe in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the Count Distinct from DataFrame in Databricks in PySpark

Ameeruddin Mohammed

Relevant Projects

You might also like

Relevant Projects